snowplow-referer-parser / referer-parser Goto Github PK
View Code? Open in Web Editor NEWLibrary for extracting marketing attribution data from referrer URLs
Home Page: http://snowplowanalytics.com
Library for extracting marketing attribution data from referrer URLs
Home Page: http://snowplowanalytics.com
Basically, the v1 library was looking for exact matches even www.google.com vs google.com
That approach is actually flawed:
So this version instead tries multiple lookups:
We should write some explicit tests into the Specs2 test suite to check that this recursion logic works correctly.
/cc @donspaulding
Architecture is very OO and convoluted.
Move towards similar architecture to the Scala version: a Parser module which includes one-time instantiation, and then a static parse() method which returns a Referer object for a given URL.
@Tombar has moved it in the right direction with a public parse() method (Tombar@a280d9e).
Unlike Ruby and Python, Java and Scala don't expose the URI. We should expose it.
Am seeing some strange Yahoo data showing up as "search".
A client ran a homepage takeover on Yahoo a few weeks back and sent a lot of traffic from the yahoo homepage from hostname "au.yahoo.com".
I know this isn't search traffic, so when I queried it in Snowplow, this hostname had no real keywords.
In contrast, the hostname "au.search.yahoo.com" had quite a few search terms.
Is this a case of not provided?
Hi guyz. Would you be so kind to update npm module :3. It's a bit outdated ('0.0.2': '2013-08-16T20:29:36.351Z')
The main things I'm thinking here are:
/cc @yalisassoon @kingo55, see also snowplow/snowplow#130
Got this rare referrer:
http://www.dogpile.com/info.dogpl/search/web?fcoid=417&fcop=topnav&fpid=27&q=HELP+speaker+dock+not+working+since+6.1.2+update+avforums&ql=
Despite I see this domain in installed 'referer_parser-0.1.0-py2.7.egg/referer_parser/data/search.json', the library does not detect it:
In [21]: r.known
Out[21]: False
Can you check with your updated copy? Cheers
referer_tests.json now has two new tests for custom internal domains functionality:
https://github.com/snowplow/referer-parser/blob/feature/json-tests/resources/referer-tests.json#L235-L248
The Referer-Parser is configured with a list of domains which should be counted as internal:
https://github.com/snowplow/referer-parser/blob/feature/json-tests/java-scala/src/test/scala/com/snowplowanalytics/refererparser/scala/JsonParseTest.scala#L41
The PHP version of the Referer-Parser already has support for internal hosts, so it should be possible to get it working with the new tests.
When done, please update sync_data.py to automatically copy the master copy of referer-tests.json into the PHP subfolder: https://github.com/snowplow/referer-parser/blob/23e3fd9f3bfaa8947fcb456ed8fbdb22f271dabc/sync_data.py#L58
Usage information is incomplete vs other examples.
referer-parser could take a cue from ua-parser by adding a YAML (or JSON) file with test cases consisting of example referer urls for each of the referers in search.yml
and the corresponding parse results.
This would greatly increase the confidence level in the consistency of future ports to other programming languages.
Suggestion from Peter O'Neill based on the blog post A new method to track keyword ranking using Google Analytics.
On occasion, Google search exposes the position of the keyword that drove the click to your website in the page_referrer
as a cd=
parameter. We should extract this from the referrer_url so that it can be stored in SnowPlow, and used to track:
In order to implement this, we'll need to:
Then in SnowPlow we'll need an addition mkt_xxx
field to store the result e.g. mkt_rank
.
I got the error:
from referer_parser import Referer
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\referer_parser__init__.py", line 32, in <
module>
REFERERS = load_referers(JSON_FILE)
File "C:\Python27\lib\site-packages\referer_parser__init__.py", line 21, in l
oad_referers
params = list(map(text_type.lower, config['parameters']))
TypeError: descriptor 'lower' requires a 'unicode' object but received a 'str'
i fix it temporaly modifing __init__.py
, load_referers function:
if 'parameters' in config:
p = config['parameters']
if text_type == unicode:
p = unicode(config['parameters'])
params = list(map(text_type.lower, p))
Python 2.7, SO:Windows 7
Hi, all, please see this referer:
http://www.google.com/search?hl=en&as_q=&as_epq=carbonite+offer+code&as_oq=&as_eq=&num=100&lr=lang_en&as_filetype=&ft=i&as_sitesearch=&as_qdr=all&as_rights=&as_occt=any&cr=&as_nlo=&as_nhi=&safe=images
I've added parameters as_q, as_epq, as_eq to my local referers.yml
.
and run
irb(main):002:0> require 'referer-parser'
=> true
irb(main):003:0> ref = "http://www.google.com/search?hl=en&as_q=&as_epq=carbonite+offer+code&as_oq=&as_eq=&num=100&lr=lang_en&as_filetype=&ft=i&as_sitesearch=&as_qdr=all&as_rights=&as_occt=any&cr=&as_nlo=&as_nhi=&safe=images"
=> "http://www.google.com/search?hl=en&as_q=&as_epq=carbonite+offer+code&as_oq=&as_eq=&num=100&lr=lang_en&as_filetype=&ft=i&as_sitesearch=&as_qdr=all&as_rights=&as_occt=any&cr=&as_nlo=&as_nhi=&safe=images"
irb(main):004:0> st = RefererParser::Referer.new(ref, 'referers.yml')
=> #<RefererParser::Referer:0x25feb227 @search_term="", @known=true, @referer="Google", @uri=#<URI::HTTP:0x746231ed URL:http://www.google.com/search?hl=en&as_q=&as_epq=carbonite+offer+code&as_oq=&as_eq=&num=100&lr=lang_en&as_filetype=&ft=i&as_sitesearch=&as_qdr=all&as_rights=&as_occt=any&cr=&as_nlo=&as_nhi=&safe=images>, @search_parameter="as_q">
irb(main):005:0> st.search_term
=> ""
I think that "carbonite offer code" is the better result.
https://github.com/snowplow/referer-parser/blob/master/ruby/lib/referer-parser/referer.rb#L85
If the referer contains two or more parameters, I prefer to return searchterm that is not nil or empty rather than the one when parameters first match.
Came across this tonight. I am interested in the php to parse domain and search terms out of url logs. Problem is I cannot use composer with my current server (WHM/Cpanel setup)... I came across this situation once before, however, they also offered a phar file of everything bundled which I could use without problems and eliminated the need for composer.
Any chance of this happening?
Hi Guys,
It'd be really useful if you could update pip with 0.2.0 - currently it only has 0.1.0 available.
Thanks a lot!
Some search engines operate load balancing etc on subdomains, leading to referers which can't be found in search.yml
. For example, a Yahoo! refererer URL might be "us.yhs4.search.yahoo.com"
The most performant way of supporting this is probably this algo:
search.yml
. Found? FinishThis should be a lot faster than switching to a regexp-based approach.
Is there any reason why duckduckgo.com is labelled as DuckDuckGoL ?
E.g. extract product/banner ID from affiliate IDs
Basically, a Java URI can be invalid to the point that .getHost() and/or .getPath() fail - so we need to catch any errors here.
Example:
http://bigcommerce%20wordpress%20plugin/
See snowplow/snowplow#314 for further details.
Most search engines are UTF8 so a nice to have.
Potentially because the &url=
isn't being escaped? (Why isn't it being escaped?)
sa=t&rct=j&q=g+star&source=web&cd=3&ved=0CEEQFjAA&url=http://www.gstars.co.uk/?ito=GAG5362963510&itc=GAC19854885430&itkw=g-stars&itawnw=search&ei=8eMQUt_hAvTSpgLjqQE&usg=AFQjCNFFNpW7yF9pcqCfOpYvqafYS94p_Q
Ticket based on #48.
Believe the bad URI fed in was: http://search.tb.ask.com/search/GGmain.jhtml
@aposashenko could you provide more information (e.g. on error message) and exact URL used here?
A test to ensure that a google.com/product
referer returns "Google Product Search" not "Google".
/cc @donspaulding as I wasn't sure that this was handled in the Python version.
The relevant line in the Ruby is:
https://github.com/snowplow/referer-parser/blob/master/ruby/lib/referer-parser/referers.rb#L28
The relevant line in the Java is:
, crossScalaVersions := Seq("2.9.2", "2.9.3", "2.10.0", "2.10.1")
And add into .travis.yml
Because we never released it to Maven.
Using Argonaut to process the tests.
Can someone with time create a phar version or do a pull request of one using box? This is above my head, but I cannot use without having a phar... would be extremely grateful if anyone has the time.
Two reasons for this:
This would involve updating the Java/Scala and Ruby ports, and making @donspaulding's build_json.py
a standard part of the distribution for all ports.
Important note: we would still keep the master copy of the database in YAML for readability/editability purposes.
org.junit.ComparisonFailure: Internal subdomain HTTP medium
Expected :internal
Actual :unknown
at org.junit.Assert.assertEquals(Assert.java:125)
at com.snowplowanalytics.refererparser.ParserTest.refererTests(ParserTest.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Unfortunately the bump to httpclient 4.3.3 has broken referer-parser on Hadoop.
Specifically it's this line of code:
81b88ff#diff-729b6a9a457c5e4a0b244bf130a1e08eR192
This is the error on Hadoop:
Caused by: java.lang.NoSuchMethodError: org.apache.http.client.utils.URLEncodedUtils.parse(Ljava/lang/String;Ljava/nio/charset/Charset;)Ljava/util/List;
at com.snowplowanalytics.refererparser.Parser.extractSearchTerm(Parser.java:205)
at com.snowplowanalytics.refererparser.Parser.parse(Parser.java:154)
at com.snowplowanalytics.refererparser.Parser.parse(Parser.java:116)
at com.snowplowanalytics.refererparser.scala.Parser$.parse(Parser.scala:153)
at com.snowplowanalytics.snowplow.enrich.common.enrichments.registry.RefererParserEnrichment.extractRefererDetails(RefererParserEnrichment.scala:107)
at com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$$anonfun$enrichEvent$2$$anonfun$apply$5.apply(EnrichmentManager.scala:291)
at com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$$anonfun$enrichEvent$2$$anonfun$apply$5.apply(EnrichmentManager.scala:277)
at scala.Option.foreach(Option.scala:236)
at com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$$anonfun$enrichEvent$2.apply(EnrichmentManager.scala:277)
at com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$$anonfun$enrichEvent$2.apply(EnrichmentManager.scala:277)
at scalaz.Validation$class.foreach(Validation.scala:126)
at scalaz.Success.foreach(Validation.scala:329)
at com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$.enrichEvent(EnrichmentManager.scala:277)
at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$toCanonicalOutput$1$$anonfun$apply$2.apply(EtlJob.scala:70)
at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$toCanonicalOutput$1$$anonfun$apply$2.apply(EtlJob.scala:70)
at scalaz.std.OptionFunctions$class.cata(Option.scala:157)
at scalaz.std.option$.cata(Option.scala:209)
at scalaz.syntax.std.OptionOps$class.cata(OptionOps.scala:9)
at scalaz.syntax.std.ToOptionOps$$anon$1.cata(OptionOps.scala:103)
at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$toCanonicalOutput$1.apply(EtlJob.scala:70)
at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$toCanonicalOutput$1.apply(EtlJob.scala:70)
at scalaz.Validation$class.flatMap(Validation.scala:141)
at scalaz.Success.flatMap(Validation.scala:329)
at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$.toCanonicalOutput(EtlJob.scala:69)
at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$7.apply(EtlJob.scala:170)
at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$7.apply(EtlJob.scala:169)
at com.twitter.scalding.MapFunction.operate(Operations.scala:58)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99)
... 11 more
Hadoop bundles an old version of httpclient which doesn't have parse(String, Charset)
. There is talk of Hadoop removing that dependency but in any case we and EMR use an oldish version of Hadoop.
The problem is not using 4.3.3 per se, but using parse(String, Charset)
. /cc @squeed
Composer (http://getcomposer.org/) is currently the fashionable way to manage PHP dependencies. Unfortunately, the composer.json needs to live in the root directory of a repository so that https://packagist.org/ allows publishing it. I see two ways:
php/
as referer-parser-php
and publish that repository@alexanderdean what do you think?
Looks like a way of avoiding having to specify all TLDs for a given search engine.
Also seems to work at the beginning of a domain too.
https://github.com/piwik/piwik/blob/master/core/DataFiles/SearchEngines.php#L906
https://github.com/piwik/piwik/blob/master/core/DataFiles/SearchEngines.php#L907
Java should be built with Gradle, no Scala deps anywhere.
Are there any interesting specific querystring fields to extract for social referrers?
See #18
Hi guys,
When I try to parce referers from that site I get exception. I use ASP.Net library.
Can you add the following source to your library?
No need to backfill
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.