Code Monkey home page Code Monkey logo

seimiagent's Introduction

SeimiAgent

A headless,standalone webkit server which make grabing dynamic web page easier.

中文文档

DownLoad

Quick Start

cd /dir/of/seimiAgent
./seimiagent -p 8000

SeimiAgent will start and listen on the port that you set.Than you can use any http client tools post a load reqest to SeimiAgent and get back the content which just like chrome do.Http client tools you can use: apache httpclient of java,curl of cmd,httplib2 of python including, but not limited to.

Demonstrates

  • basic

demo

you can see it here,if it is loaded fail in github

  • significantly simplify the login of a complex system by using js

you can view video in blog.

Http parameters that seimiAgent support

Only support post.Request path:/doload

  • url your target url

  • renderTime How long time you hope to give seimiAgent to process javascript action and document after load finashed.Milliseconds.

  • proxy Tell SeimiAgent to use proxy.Pattern:http|https|socket://user:passwd@host:port

  • postParam Json string only,tell seimiAgent you want to use http post method and pass the parameters in postParam.

  • useCookie If useCookie==1,seimiAgent deem you want to use cookie.Default 0.

  • contentType Determine the output format,you can choose img or pdf,default is html.

  • script A javascript script which can operate current html document and just seem like in chrome console to execute.

  • ua Set your userAgent

  • resourceTimeout Set resource request timeout,such as js resource etc.Default resource timeout 20000ms.

How to build

It will take a very long time to build,so it is recommended to use the premade binary file in 'Download'.

Requirements

  • on ubuntu
sudo apt-get install build-essential g++ flex bison gperf ruby perl libsqlite3-dev libfontconfig1-dev libicu-dev libfreetype6 libssl-dev libpng-dev libjpeg-dev python libx11-dev libxext-dev
  • on centos
yum -y install gcc gcc-c++ make flex bison gperf ruby openssl-devel freetype-devel fontconfig-devel libicu-devel sqlite-devel libpng-devel libjpeg-devel

Build

python build.py

Then wait or take a cup of tea.

More

More Doc is on his way...

seimiagent's People

Contributors

zhegexiaohuozi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

seimiagent's Issues

SeimiAgent has crashed

this is the all error information:

HttpConnection::writeResponse called while state is not 'SendingHeaders', not proceeding with sending headers.
HttpConnection::writeResponse called while state is not 'SendingHeaders', not proceeding with sending headers.
HttpConnection::writeResponse called while state is not 'SendingHeaders', not proceeding with sending headers.

detail

base environment :aliyun 1core 1G mem ,centos7
runtime environment :docker version 1.9.1
start command : nohup ./seimiagent -p 12345 &
when I request about 50times , it has been crashed

SeimiAgent has crashed

SeimiAgent crashed for download https://web2.cylex.de/firma-home/euronics-elektrohaus-klaes-7531765.htm
1 0x1a76307 /opt/webkit/webkitagent() [0x1a76307]
2 0x1abbac5 /opt/webkit/webkitagent() [0x1abbac5]
3 0x1aa347d /opt/webkit/webkitagent() [0x1aa347d]
4 0x178583d /opt/webkit/webkitagent() [0x178583d]
5 0x1a4a5ba /opt/webkit/webkitagent() [0x1a4a5ba]
6 0x7fd98c8020e5 [0x7fd98c8020e5]
SeimiAgent has crashed.
You can go to https://github.com/zhegexiaohuozi/SeimiAgent/issues and report a bug.

SeimiAgent has crashed.

[Seimi] All load finished.
[Seimi] Document render out over.
HttpConnection::writeResponse called while state is not 'SendingHeaders', not proceeding with sending headers.
SeimiAgent has crashed.
You can go to https://github.com/zhegexiaohuozi/SeimiAgent/issues and report a bug.
Such as your os version,physical memory size,app version,current url etc.
Segmentation fault

是否支持@font-face?

因为font-face是css3才支持的,我测试好像不支持啊!是否需要自己构建QtWebkit新版本才行?

设置渲染时间>20000ms程序会死掉,seimiagent_linux_v1.3.1_x86_64.tar.gz

设置渲染时间>20000ms程序重复请求,最终[Seimi] Document render out over.
HttpConnection::writeResponse called while state is not 'SendingHeaders', not proceeding with sending headers.
HttpConnection::writeResponse called while state is not 'SendingHeaders', not proceeding with sending headers.
HttpConnection::writeResponse called while state is not 'SendingHeaders', not proceeding with sending headers.
HttpConnection::writeResponse called while state is not 'SendingHeaders', not proceeding with sending headers.
HttpConnection::writeResponse called while state is not 'SendingHeaders', not proceeding with sending headers.
SeimiAgent has crashed.
You can go to https://github.com/zhegexiaohuozi/SeimiAgent/issues and report a bug.
Such as your os version,physical memory size,app version,current url etc.

delay延迟设置无效,导致SeimiAgent奔溃

[seimi] TargetUrl[http://www.tianyancha.com/search/深圳市城市建设开发(集团)公司?type=company] process:100%
[Seimi] All load finished.
[seimi] TargetUrl[http://www.tianyancha.com/search/深圳市城市建设开发(集团)公司?type=company] STATS [10 requests total] [30.00% from cache] [0.00% pipelined] [0.00% SSL/TLS] [30.00% Zerocopy]
[seimi] TargetUrl[http://www.tianyancha.com/search/深圳市万科房地产有限公司?type=company] STATS [10 requests total] [10.00% from cache] [0.00% pipelined] [0.00% SSL/TLS] [40.00% Zerocopy]
[Seimi] Document render out over.
127.0.0.1 - - [05/07/2016 10:07:48] "POST /doload HTTP/1.1" 200 88849 5.815
HttpConnection::writeResponse called while state is not 'SendingHeaders', not proceeding with sending headers.
HttpConnection::writeResponse called while state is not 'SendingHeaders', not proceeding with sending headers.
HttpConnection::writeResponse called while state is not 'SendingHeaders', not proceeding with sending headers.
SeimiAgent has crashed.
You can go to https://github.com/zhegexiaohuozi/SeimiAgent/issues and report a bug.
段错误 (核心已转储)

现象:我在start(Request req)里面循环push的很多个请求,设置了SeimiAgent渲染时间1秒,Crawler类的delay位20秒,请求队列位默认的,运行的时候第一个请求是有延迟20秒,后面的没有延迟,SeimiAgent控制台直接打印了一堆所有链接的同时请求,然后就奔溃了。
问题:如何让每个请求完成后到下个请求都有延迟呢?SeimiAgent访问太快,目标网站就会要求输入验证码,谢谢!!!

SeimiAgent无法build

环境:Ubuntu
准备:下载好指定的qtbase和qtwebkit到seimiagent的src/qt目录
操作:运行build.py脚本
结果:说缺少QtCore目录下的.h文件
尝试1:找到对应.h添加到inlude/QtCore目录,继续报写一个.h找不到。失败
尝试2:下载Qt安装,把QtCore目录拷贝到seimiagent里qtbase/include目录。失败,源代码报错了
期望结果:能说明运行需要哪些环境,以及那个环境需要怎么配置,对于Java开发者,希望能直接按流程跑通过后微微改下c代码就行了,但是跑不通就没办法愉快的玩耍了。希望能得到大神您的帮助,谢谢。

什么时候写个api文档

什么时候写个api文档,现在在用phantomjs,有类似的文档的话把底层换成semiagent看看。那个更新好慢。我用nodejs做的网页截图服务。返回二进制和base64格式图片。他的一些接口没实现还要绕过去。

运行特别容易崩溃,设置渲染时间有问题

运行特别容易崩溃,或者运行时间长就奔崩溃了.....

文档中说网页复杂,需要设置渲染时间,但发现,渲染时间设置不到,抓取网页的连接就报超时了,后来发现是连接网页时的socket也有超时时间,但这个socket的超时时间最长只有默认的1500毫秒,

作者,QEventDispatcherUNIXPrivate(): Can not continue without a thread pipe。

我百度了一下,这个好像是PhantomJS使用太多线程,没有及时回收(堆栈溢出,需要手动释放?)。我没有接触过这个东西,也不知道怎么解决。
以下是我百度的:
https://github.com/ariya/phantomjs/issues/14180(这里说QT5.7将会解决这个问题,所以是不是需要更新一下?)
https://stackoverflow.com/questions/10013094/does-the-qt-event-listener-occupy-a-file-handle
https://stackoverflow.com/questions/15005830/phantomjs-using-too-many-threads
希望作者看一下这个问题。谢谢。

渲染报错

操作系统及版本:Linux version 2.6.32-573.el6.x86_64 ([email protected]) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Thu Jul 23 15:44:03 UTC 2015
SeimiAgent版本:SeimiAgent,a headless,standalone webkit server which make grabing dynamic web page easier. 1.3.0
出问题的url: http://www.baidu.com
描述: 第一次没有加时间限制,渲染后 服务器端报错: 加了时间限制后没有问题, 之后放开时间限制 也没有问题。

centos can't run

./seimiagent -p 9527

./seimiagent: error while loading shared libraries: libwebp.so.5: cannot open shared object file: No such file or directory

downlaod this: http://www.linuxfromscratch.org/blfs/view/svn/general/libwebp.html make && make install
and can't run error again.

lsb_release -a

LSB Version: :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
Distributor ID: CentOS
Description: CentOS release 5.8 (Final)
Release: 5.8
Codename: Final

script参数如何使用?

script参数返回结果在哪?
我看到返回evaluateJavaScript done. script = "..." ,没有看到执行结果

列表页ajax翻页问题

即第一次加载完页面,后续操作只需发javascript代码进行操作得到结果页面,而不是每次都需要访问这个页面再后续操作。

提个BUG:push个数有点多的时候内部报错

java.lang.NullPointerException: null
at java.security.MessageDigest.update(MessageDigest.java:323) ~[na:1.7.0_79]
at java.security.MessageDigest.digest(MessageDigest.java:398) ~[na:1.7.0_79]
at org.apache.commons.codec.digest.DigestUtils.md5(DigestUtils.java:165) ~[commons-codec-1.6.jar:1.6]
at org.apache.commons.codec.digest.DigestUtils.md5(DigestUtils.java:190) ~[commons-codec-1.6.jar:1.6]
at org.apache.commons.codec.digest.DigestUtils.md5Hex(DigestUtils.java:226) ~[commons-codec-1.6.jar:1.6]
at cn.wanghaomiao.seimi.def.DefaultLocalQueue.isProcessed(DefaultLocalQueue.java:71) ~[classes/:na]
at cn.wanghaomiao.seimi.core.SeimiProcessor.run(SeimiProcessor.java:92) ~[classes/:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_79]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]

希望BaseSeimiCrawler能设置线程数,谢谢查看

SeimiAgent has crashed

Download url:https://www.google.com/search?safe=strict&site=&source=hp&q="FACC" ag Competive Advantage

...skipping...
HttpConnection::writeResponse called while state is not 'SendingHeaders', not proceeding with sending headers.
SeimiAgent has crashed.
You can go to https://github.com/zhegexiaohuozi/SeimiAgent/issues and report a bug.
Such as your os version,physical memory size,app version,current url etc.

detail:

base environment :CentOS 7.1 2C 8G Mem
start command : nohup ./seimiagent -p 8080 &
when I request about 50times , it has been crashed

Read timed out

已经设置了120秒的超时,看SeimiAgent已经结束了页面读取,但就是等到了超时的返回,求解.

只要一使用UA就报错,求大神指导

10:01:54 ERROR c.w.seimi.core.SeimiProcessor - timeout
java.net.SocketTimeoutException: timeout
at okio.Okio$3.newTimeoutException(Okio.java:212) ~[okio-1.8.0.jar:na]
at okio.AsyncTimeout.exit(AsyncTimeout.java:288) ~[okio-1.8.0.jar:na]
at okio.AsyncTimeout$2.read(AsyncTimeout.java:242) ~[okio-1.8.0.jar:na]
at okio.RealBufferedSource.indexOf(RealBufferedSource.java:325) ~[okio-1.8.0.jar:na]
at okio.RealBufferedSource.indexOf(RealBufferedSource.java:314) ~[okio-1.8.0.jar:na]
at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:210) ~[okio-1.8.0.jar:na]
at okhttp3.internal.http.Http1xStream.readResponse(Http1xStream.java:184) ~[okhttp-3.3.1.jar:na]
at okhttp3.internal.http.Http1xStream.readResponseHeaders(Http1xStream.java:125) ~[okhttp-3.3.1.jar:na]
at okhttp3.internal.http.HttpEngine.readNetworkResponse(HttpEngine.java:775) ~[okhttp-3.3.1.jar:na]
at okhttp3.internal.http.HttpEngine.access$200(HttpEngine.java:86) ~[okhttp-3.3.1.jar:na]
at okhttp3.internal.http.HttpEngine$NetworkInterceptorChain.proceed(HttpEngine.java:760) ~[okhttp-3.3.1.jar:na]
at okhttp3.internal.http.HttpEngine.readResponse(HttpEngine.java:613) ~[okhttp-3.3.1.jar:na]
at okhttp3.RealCall.getResponse(RealCall.java:244) ~[okhttp-3.3.1.jar:na]
at okhttp3.RealCall$ApplicationInterceptorChain.proceed(RealCall.java:201) ~[okhttp-3.3.1.jar:na]
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:163) ~[okhttp-3.3.1.jar:na]
at okhttp3.RealCall.execute(RealCall.java:57) ~[okhttp-3.3.1.jar:na]
at cn.wanghaomiao.seimi.http.okhttp.OkHttpDownloader.process(OkHttpDownloader.java:74) ~[classes/:na]
at cn.wanghaomiao.seimi.core.SeimiProcessor.run(SeimiProcessor.java:106) ~[classes/:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_79]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190) ~[na:1.7.0_79]
at java.net.SocketInputStream.read(SocketInputStream.java:122) ~[na:1.7.0_79]
at okio.Okio$2.read(Okio.java:140) ~[okio-1.8.0.jar:na]
at okio.AsyncTimeout$2.read(AsyncTimeout.java:238) ~[okio-1.8.0.jar:na]
... 18 common frames omitted
10:01:54 INFO c.w.seimi.core.SeimiProcessor - Request process error,req will go into queue again,url=http://www.bjstats.gov.cn/tjsj/yjdsj/GDP/2015,maxReqCount=3,currentReqCount=1

cookie失效

v1.3.1版本中 我登录后 再访问需要登录信息的页就不能访问了
我是参考视频中的京东登录的 尝试了其他的几个登录 都不行
登录是能成功的 也加了参数useCookie=1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.