veliovgroup / jazeee-meteor-spiderable Goto Github PK
View Code? Open in Web Editor NEWFork of Meteor Spiderable with longer timeout, caching, better server handling
Home Page: https://atmospherejs.com/jazeee/spiderable-longer-timeout
Fork of Meteor Spiderable with longer timeout, caching, better server handling
Home Page: https://atmospherejs.com/jazeee/spiderable-longer-timeout
Hey,
I'm having a strange problem with Spiderable. If I call f.e. mysite.com/?escaped_fragment=
I get the following error on my server and get redirected to the normal site on client side:
spiderable: phantomjs failed: { [Error: Command failed: ] killed: true, code: null, signal: 'SIGTERM' }
If I now call mysite.com/blablabla/?escaped_fragment=
I get the rendered version of my start page. Any ideas how Spiderable doesn't work on defined works, but works on undefined routes?
I have a probleme with my website, what I try to do, my pages return nothing with a 204 HTTP status code with ?_escaped_fragment_=
, else, the 404 page send correctly the 404 HTTP status code, but is also empty
if(Meteor.isServer){
Spiderable.debug = true;
Spiderable.customQuery = true;
}
Router.configure({
notFoundTemplate: '_404'
});
Router.plugin('dataNotFound', {notFoundTemplate: '_404'});
Router.route('/', function () {
this.render('main');
});
Router.onAfterAction(function(){
if(this.ready()){
Meteor.isReadyForSpiderable = true;
}
});
console.log('Scripting...');
[test-buom01.rhcloud.com ******]\> mongo $OPENSHIFT_MONGODB_URLtest
MongoDB shell version: 2.4.9
connecting to: 127.*.*.*:27017/admin
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
http://docs.mongodb.org/
Questions? Try the support group
http://groups.google.com/group/mongodb-user
> use test
switched to db test
> db.SpiderableCacheCollection.find({})
{ "_id" : "yLhw4TicXDzEeT8E5", "hash" : "f807c97ecbb6087b674f75cc45e714675b8cdd0b56959f7600c89ebc860f9678", "url" : "http://test-buom01.rhcloud.com/?___isRunningPhantomJS___=true", "headers" : [ { "name" : "Date", "value" : "Mon, 03 Aug 2015 13:41:14 GMT" }, { "name" : "Vary", "value" : "Accept-Encoding" }, { "name" : "Content-Type", "value" : "text/html; charset=utf-8" }, { "name" : "Content-Encoding", "value" : "gzip" }, { "name" : "Keep-Alive", "value" : "timeout=15, max=100" }, { "name" : "Connection", "value" : "Keep-Alive" } ], "content" : "<!DOCTYPE html><html><head>\n <link rel=\"stylesheet\" type=\"text/css\" class=\"__meteor-css__\" href=\"/20ae2c8d51b2507244e598844414ecdec2615ce3.css\">\n\n\n\n\n \n\n\n\n\n<title>spiderable-test</title>\n</head>\n<body>\n\n\n\n<h1>Page d'accueil</h1></body></html>", "status" : 204, "createdAt" : ISODate("2015-08-03T13:41:15.360Z") }
>
{
"_id":"yLhw4TicXDzEeT8E5",
"hash":"f807c97ecbb6087b674f75cc45e714675b8cdd0b56959f7600c89ebc860f9678",
"url":"http://test-buom01.rhcloud.com/?___isRunningPhantomJS___=true",
"headers":[
{
"name":"Date",
"value":"Mon, 03 Aug 2015 13:41:14 GMT"
},
{
"name":"Vary",
"value":"Accept-Encoding"
},
{
"name":"Content-Type",
"value":"text/html; charset=utf-8"
},
{
"name":"Content-Encoding",
"value":"gzip"
},
{
"name":"Keep-Alive",
"value":"timeout=15, max=100"
},
{
"name":"Connection",
"value":"Keep-Alive"
}
],
"content":"<!DOCTYPE html><html><head>\n <link rel=\"stylesheet\" type=\"text/css\" class=\"__meteor-css__\" href=\"/20ae2c8d51b2507244e598844414ecdec2615ce3.css\">\n\n\n\n\n \n\n\n\n\n<title>spiderable-test</title>\n</head>\n<body>\n\n\n\n<h1>Page d'accueil</h1></body></html>",
"status":204,
"createdAt": ISODate("2015-08-03T13:41:15.360 Z")
}
./.compress2.sh
export MINI="./.mini/nodejs"
echo "{{{{BUILDING}}}}"
cd $MINI
echo "Remove output"
chmod -R 777 ./.demeteorized
rm -R ./.demetorized
echo "Build and demeteorization..."
demeteorizer
echo "Fix permissions again..."
chmod -R 777 ./.demeteorized
cd ./.demeteorized
echo "Adding env vars..."
# settings.json is actually empty
echo "process.env.METEOR_SETTINGS = '$(echo $(cat ../settings.json))';
$(cat ./main.js)" > ./main.js
#sed -i '1i process.env.MAIL_URL = "smtp://*********/";' ./main.js
#sed -i '1i process.env.ROOT_URL = ("http://" + process.env.OPENSHIFT_APP_DNS) || "http://localhost:8000"' ./main.js
#sed -i '1i process.env.MONGO_URL = (process.env.OPENSHIFT_MONGODB_DB_URL + process.env.OPENSHIFT_APP_NAME) || "mongodb://localhost:27017/meteor";' ./main.js
#sed -i '1i process.env.PORT = process.env.OPENSHIFT_NODEJS_PORT || 8000;' ./main.js
#sed -i '1i process.env.BIND_IP = process.env.OPENSHIFT_NODEJS_IP || "127.0.0.1";' ./main.js
#sed -i '1i process.env.MAIL_URL = "smtp://*******/";' ./main.js
sed -i '1i process.env.ROOT_URL = "http://" + (process.env.OPENSHIFT_APP_DNS || "localhost:8000");' ./main.js
#sed -i '1i process.env.ROOT_URL = "http://"+ process.env.OPENSHIFT_NODEJS_IP + ":" + process.env.OPENSHIFT_NODEJS_PORT;' ./main.js
sed -i '1i process.env.MONGO_URL = (process.env.OPENSHIFT_MONGODB_DB_URL + process.env.OPENSHIFT_APP_NAME) || "mongodb://localhost:27017/meteor";' ./main.js
sed -i '1i process.env.PORT = process.env.OPENSHIFT_NODEJS_PORT || 8000;' ./main.js
sed -i '1i process.env.BIND_IP = process.env.OPENSHIFT_NODEJS_IP || "127.0.0.1";' ./main.js
echo "Copying into git directory..."
cd ../../..
rm -R ./.end/*
#mkdir ./.end
#cp -R ./.static-files/.git ./.end/
#cp -R ./.static-files/.openshift ./.end/
cp -R $MINI/.demeteorized/* ./.end/
echo "Done"
echo "{{{{COMPRESSION}}}}"
export MINI="./.mini/nodejs"
echo "Fix permissions..."
chmod -R 777 $MINI
echo "Removing..."
rm -R $MINI
echo "Copying..."
mkdir $MINI
cp -R * $MINI
cp -R ./.meteor $MINI
echo "Minifying..."
#htmlminify -o $MINI/spiderable-test.html $MINI/spiderable-test.html
echo "Remove enters..."
#echo $(cat $MINI/spiderable-test.html)>$MINI/spiderable-test.html
echo "Done"
cd .end
git add .
git commit -m "Update"
git push
http://test-buom01.rhcloud.com/
http://test-buom01.rhcloud.com/?_escaped_fragment_=
Anybody can help me ?
What can I do ?
Thank you for reading, and sorry if I have a bad English
@jazeee Hi,
Sometimes we're expecting this issue:
Error serving static file Error: Requested Range Not Satisfiable
Error: Meteor code must always run within a Fiber. Try wrapping callbacks that you pass to non-Meteor libraries with Meteor.bindEnvironment.
at Object.Meteor._nodeCodeMustBeInFiber (packages/meteor/dynamics_nodejs.js:9:1)
at [object Object]._.extend.get (packages/meteor/dynamics_nodejs.js:21:1)
at [object Object].RouteController.lookupOption (packages/iron:router/lib/route_controller.js:66:1)
at new Controller.extend.constructor (packages/iron:router/lib/route_controller.js:26:1)
at [object Object].ctor (packages/iron:core/lib/iron_core.js:88:1)
at Function.Router.createController (packages/iron:router/lib/router.js:201:1)
at Function.Router.dispatch (packages/iron:router/lib/router_server.js:39:1)
at Object.router (packages/iron:router/lib/router.js:15:1)
at next (/Users/dmitriygolev/.meteor/packages/webapp/.1.2.0.xohm6p++os+web.browser+web.cordova/npm/node_modules/connect/lib/proto.js:190:15)
at packages/jazeee:spiderable-longer-timeout/spiderable_server.js:148:1
But we can not figure out how to fix it
I'm struggling a lot with PhantomJS due to a select: Invalid argument
error (reported 2 years ago and never fixed by the Phantom team) and the advice on getting around this on other web projects is to switch to a newer emulator, like NightmareJS or Electron (which Nightmare is based on).
I'm about to attempt it on a fork of this project, but wondering first what your thoughts are.
My script ld+json does not appear in ?_escaped_fragment_= mode
<script type="application/ld+json"></script>
Is spiderable removed all script tags?
I'm using Spiderable with Phantomjs and I need the original request object in order to fetch header properties (in my case I need accept-language header). We could achieve this and allow Meteor to use that property by modifying the spiderable_server.js file in this way:
WebApp.connectHandlers.use(function (req, res, next) {
// _escaped_fragment_ comes from Google's AJAX crawling spec:
// https://developers.google.com/webmasters/ajax-crawling/docs/specification
if (/\?.*_escaped_fragment_=/.test(req.url) ||
_.any(Spiderable.userAgentRegExps, function (re) {
return re.test(req.headers['user-agent']); })) {
Spiderable.originalReq = req; // this is the new property
var url = Spiderable._urlForPhantom(Meteor.absoluteUrl(), req.url);
In this way I can use Spiderable.originalReq
to read from Meteor the original request headers.
Did that all make sense? :)
Let me know if you want me to submit a pull request.
We should wait for subscriptions to complete before rendering.
In IronRouter, we need to wait for this.ready() before setting Meteor.isReadyForSpiderable = true
In other words:
if @ready()
Meteor.isReadyForSpiderable = true
You should either state in the readme that Spiderable.ignoredRoutes.push(โฆ)
must be in server-only code or provide a dummy value for client code.
Currently I get Uncaught TypeError: Cannot read property 'push' of undefined
on client in development mode which doesn't affect overall site behaviour but completely breaks the site down when deployed on production server.
Thank you for this wonderful package (!!!)
I noticed a strange error when we deployed this package on the production system running nginx + phusion passenger with enabled gzip encoding.
The header Content-Encoding:gzip
gets correctly set on the cached page in the collection,
accessing the cached page then returns a ERR_CONTENT_DECODING_FAILED
in Chrome until I manually remove the content-encoding header from the cached document in the db.
As a quick (and working) fix I did the following:
if result.headers?.length > 0
for header in result.headers
res.setHeader header.name, header.value if (header.value != 'gzip')
else
res.setHeader 'Content-Type', 'text/html'
res.writeHead result.status
You may want to take a look at this problem as I did not have enough time for a thorough investigation.
cheers
//s
Hi. I have a problem
spiderable: phantomjs failed: { [Error: Command failed: /bin/sh -c phantomjs --load-images=no --ssl-protocol=TLSv1 --ignore-ssl-errors=true --web-security=false /bundle/bundle/programs/server/assets/packages/jazeee_spiderable-longer-timeout/lib/phantom_script.js "https://taptospeak.com/route/path#!:"
]
killed: true,
code: null,
signal: 'SIGTERM',
cmd: '/bin/sh -c phantomjs --load-images=no --ssl-protocol=TLSv1 --ignore-ssl-errors=true --web-security=false /bundle/bundle/programs/server/assets/packages/jazeee_spiderable-longer-timeout/lib/phantom_script.js "https://mywebsite.com/route/path#!:"' }
I have no idea how to find solution.
I have a concern regarding Meteor.isReadyForSpiderable
variable.
It seems to me that phantom_script.js
that is placed by package settings under Meteor Assets (full path "#{Meteor.rootPath}/assets/packages/jazeee_spiderable-longer-timeout/lib/phantom_script.js"
doesn't even have access to such variables as Meteor
, Package
.
That's why it can't "read" Meteor.isReadyForSpiderable
variable.
Reasons why I think so:
In our project routes that even don't have Meteor.isReadyForSpiderable = true
setting at all are "spidered" correctly.
Pages that are normally loaded in less than 300ms are "spidered" with ?escaped_fragment= longer than 4 seconds.
If in if(totalIterations > 200)
setting in the phantom_script.js
change totalIterations value to 50 or 100, pages are spidered much more quickly, but in some cases spiderable returns page in "loading" state that tells us that not all subscriptions are ready. Again, even if corresponding route has Meteor.isReadyForSpiderable = true
setting after @ready
.
See attached screenshots with load time for the page with ?escaped_fragment= and without it.
Who has seen similar behaviour or am I missing something?
Waiting for server response with ?escaped_fragment=
Total time WITH ?escaped_fragment=
Total time WITHOUT ?escaped_fragment=
Hi,
We use nginx as a load balancer and we also configured nginx to redirect http requests to https requests using a 301 redirect. As far as I know this is a pretty standard practice.
The problem is, when these redirects are set up, spiderable breaks.
When I try to debug using cURL I get the following messages on the server;
Spiderable successfully completed for url: [301] http://example.com/
No matter wether I request the https variant or the http variant (using -L option to follow redirects) this is the message the server returns.
When I cURL the http variant I get the default HTML template with an empty body.
When I cURL the https variant I get a cut off reply liek this;
`
Hey,
I'm having a issue with your package on my local dev machine. I'm using Meteor 1.4 with React and getting 204 No content
after requesting http://localhost:3000/?_escaped_fragment_= Spiderable debug doesn't show anything. Does this package still work with 1.4?
We should transfer ownership of this repository to @dr-dimitru
See https://help.github.com/articles/about-repository-transfers/
The requirements:
The considerations:
git remote
There may be some other considerations, but I believe the risk is quite low.
Hey @jazeee .
It seems that 301 works incorrectly because it doesn't pass back to client 'Location'
value in the headers of the 301 redirect.
Or am I missing something?
To fix this something like this must be added to server.coffee
:
location = output.content.match /.*(location url="(.*)").*/mi
if location?[2]
output.headers.push
name: 'Location'
value: location[2]
In this example location url = <REDIRECT URL>
is added to HTML contents by 301 Redirect Template.
Hi!
I'm getting this error when try to render this url: https://www.escapistas.club/?_escaped_fragment_=
2016-09-28T16:57:08.269592+00:00 app[web.1]: spiderable: phantomjs failed: { [Error: Command failed: /bin/sh -c phantomjs --load-images=no --ssl-protocol=TLSv1 --ignore-ssl-errors=true --web-security=false /app/.meteor/heroku_build/app/programs/server/assets/packages/jazeee_spiderable-longer-timeout/lib/phantom_script.js "https://www.escapistas.club/"
2016-09-28T16:57:08.269607+00:00 app[web.1]: ]
2016-09-28T16:57:08.269608+00:00 app[web.1]: killed: false,
2016-09-28T16:57:08.269609+00:00 app[web.1]: code: 255,
2016-09-28T16:57:08.269610+00:00 app[web.1]: signal: null,
2016-09-28T16:57:08.269612+00:00 app[web.1]: cmd: '/bin/sh -c phantomjs --load-images=no --ssl-protocol=TLSv1 --ignore-ssl-errors=true --web-security=false /app/.meteor/heroku_build/app/programs/server/assets/packages/jazeee_spiderable-longer-timeout/lib/phantom_script.js "https://www.escapistas.club/"' }
PhantomJS is installed on my Heroku web, is in the path and works locally.
Also, when execute i'm getting no response when execute this command in my heroku bash:
phantomjs --load-images=no --ssl-protocol=TLSv1 --ignore-ssl-errors=true --web-security=false /app/.meteo r/heroku_build/app/programs/server/assets/packages/jazeee_spiderable-longer-timeout/lib/phantom_script.js "htt ps://www.escapistas.club/"
Any idea?
Thanks for the help
I think that Spiderable should redirect crawlers from http://exemple.com/?_escaped_fragment_= to http://exemple.com/ with a 302 redirect.
Why ? Else, google detect duplicate content, not SEO friendely :(
If I say that, it's because google has also indexed my pages with ?_escaped_fragment_=
phantomjs is no longer supported and gives everyone nightmares.
It's possible to use firefox or chrome in headlessmode instead.
I did a proof-of-concept: (meteor/meteor#8661)
what do you guys think?
Edit: I know switched to another solution: I hosted an instance of https://github.com/bosondata/chrome-prerender/
and used this nginx config to prerender pages: https://gist.github.com/thoop/8165802
Hey guys,
I'm having some trouble with your spiderable package. I only get an empty body, but spiderable logs that everything was successfully:
Spiderable successfully completed for url: [200] http://www.myapp.com/about/
In my quellcode, I get the normal section with an empty body. I'm using Ubuntu 14.04 for deployment.
If I try it on my Win 10 dev machine, I get for every single page the same content. For example, "/about" shows the content of "/" when I try to call it with "?escaped_fragment=".
Hi,
I'm having some trouble setting Meteor.isReadyForSpiderable = true;
when subscriptions are defined in the template and not the route.
Paginated tables are good example of this case.
I tried to wait Template.subscriptionReady()
in the Template.templateName.onCreated
callback, but there's nothing concluent.
Do you have any idea ?
Here's my code:
Template.list.onCreated(function () {
this.autorun(() => {
if (this.subscriptionReady()) {
Meteor.isReadyForSpiderable = true;
console.log('ready');
} else {
Meteor.isReadyForSpiderable = false;
console.log('not ready yet');
}
console.log(Meteor.isReadyForSpiderable);
});
});
Thanks,
hi, iam using Spiderable.ignoredRoutes configuration directive, but spiderable still trying to fetch page.
while i debug library i think problem is in
return next();
});
at end of server.js
Thanks for help,
M
I found this package very interesting as i can decide when to release the content for spiders, but it crashes right after the installation ends with an "unexpected token (1, 81)", could someone help? i did not found any reference to this error anywhere so i conclude that it is specific to my installation; i am using Meteor 1.4.2.3 along with Angular.
thanks in advance
I'm highly suggest to use github's releases to track the package versioning and changes.
@dr-dimitru - These are my thoughts on the issue.
In my testing, the phantom script behaves differently depending on the server load and CPU.
In particular, I have found that the following area of code seems to respond differently depending on how long one waits before responding.
https://github.com/jazeee/jazeee-meteor-spiderable/blob/master/lib/phantom_script.js#L72
As a simple test, I replace that line with if(renderIterations < 30 ){
which is the equivalent of waiting 3 seconds before processing the response. When I do that, I get the correct 404 response. If I change it to 20, it responds with a 304 or occasionally 404 or 200.
If I do the same on a smaller Meteor project, I always see 404, so I conclude that it is dependent on project size, and probably CPU and system load.
One thing that we may be able to do is be more reliant on Meteor.isReadyForSpiderable
, which should only be set once the route has completed and all subscriptions are ready.
I will test this approach...
Hey guys,
I'm getting the following error when I try to call my page with ?escaped_fragment=
Couldn't find a template named "layout" or "layout". Are you sure you defined it?
It works fine without the escaped fragment. Anybody knows what causes this error?
When I deploy my application and I call it with escaped fragment, I only get the default Iron:Router page:
iron:router
Organize your Meteor application.
after curl http://localhost:3000/\?_escaped_fragment_\=
get lots of select: Invalid argument
, and can't stop
I do these steps:
meteor add jazeee:spiderable-longer-timeout
Meteor.isReadyForSpiderable = true;
in one of my react component's componentDidMount()
I use react-router
, so .... I can't find Router.onAfterAction
...
I want some suggestions or sample.....
Thanks a lot!
Hey,
could you elaborate on this sentence:
"Important You will need to set Meteor.isRouteComplete=true when your route is finished, in order to publish."
what exactly does that mean, and what would i need to do if I want to use your package?
cheers
Will there be an upgrade soon to support routers that use ES6
Been having issues with iron router showing the 'Splash Screen' and on investigation (thanks to @dr-dimitru for the help too), It turned out that the problem was because it didn't support es6.
In common nginx or Meteor configurations, if I visit a route that represents 404, I should see a 404 response code when spidering. (Important for SEO).
We have some discussion in #14 but should separate this in a new ticket.
Since Meteor/IronRouter seems to always return a 200 code, (which may occur after redirects etc), we may have to set some factor like Spiderable.responseCode=404, prior to completing the route. In this scenario, we would have to do this before we indicate that the route and subscriptions are complete (using Meteor.isReadyForSpiderable = true)
Hey,
I've been testing the package and most of the time its works but sometimes (in my case it is important) it just doesn't wait 30 sec. after 6-7 sec it just load the page. but in my case it just shows the loader normally page can load in a sec. or 2. So I'm not sure how to handle this. any idea?
Have been using this package for a while. However, this morning, phantom hell broke loose.
spiderable: phantomjs failed: { [Error: Command failed: ] killed: true, code: null, signal: 'SIGTERM' }
Running using mupx deploy
command.
Still works fine locally.
Hello!
Any plans to support flowrouter and/or template-level subscriptions?
For some reason, this morning, after running for a week, Spiderable is failing.
I see:
Exception in callback of async function: SyntaxError: Unexpected end of input
at Object.parse (native)
at packages/jazeee:spiderable-longer-timeout/lib/server.coffee:169:22
at packages/jazeee:spiderable-longer-timeout/lib/server.coffee:33:2
at runWithEnvironment (packages/meteor/dynamics_nodejs.js:108:1)
Hi,
Nice package, it was the only one working for me, I've tried with an app that has subdomains and it returns the same page with ?_escaped_fragment_=
in all of the subdomains. I'm not sure, but it seems related to use Meteor.absoluteUrl()
insted of req.headers.host
. I've reviewed the collection for SpiderableCacheCollection and it takes only the Meteor.absoluteUrl()
Would you consider make changes like this ?
Currently, all pages return with status code of 200, even if the route does not exist.
Returning 404 for non-existent pages is important for SEO.
This is a bit tricky in Meteor.
Meteor and IronRouter don't really allow you to easily return 404's.
Even though IronRouter says it defaults to 404's, it doesn't seem to do that. Instead it renders a default template. I can't figure out how to make it 404. There are a number of discussions online...
What I was able to do is to specify a default route using
Router.configure({notFoundTemplate: 'pageNotFound'})
And for the pageNotFound template, I can add:
Template.pageNotFound.rendered = ->
Meteor.isPageNotFound = true
Meteor.errorResponseCode = 404
Then, I can use that in phantomJS. I have that working, which was a bit of a pain dealing with PhantomJS specifics...
Hi there. I have this error in logs instead of phantomjs error message.
Exception in callback of async function: SyntaxError: Unexpected token ^~H
at Object.parse (native)
at packages/jazeee:spiderable-longer-timeout/lib/server.coffee:169:22
at packages/jazeee:spiderable-longer-timeout/lib/server.coffee:33:2
at runWithEnvironment (packages/meteor/dynamics_nodejs.js:108:1)
My phantomjs version is
~$ phantomjs --version
1.9.0
Hi, thanks for this package.
I just tried it and it's showing me
Iron:Router's message on my homepage.
in the debug console, o get this
Spiderable successfully completed [from cache] for url: [200] http://localhost:3000/
meaning it was successful.
How can I solve this?
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.