Code Monkey home page Code Monkey logo

salyut's Introduction

中文 | EN

Salyut

基于标记语言的开源爬虫框架。查看Salyut语法。

GitHub license MVN version Build Status

Salyuttrico script的的解析执行引擎,通过简单的调用Salyut中方法,即可以运行trico script并得到相应的结果。Salyut是一个开源项目,您可以自行修改expr目录下的类定制自己语法表达式,也可以通过修改或增加token目录下的类来扩充Salyut的能力。

Salyut基于的技术

  • 通过Yaml来对词法进行解析,如果您对Yaml有一定的了解,可以更好的帮助您提升trico script的一些语言特性。

  • 主要通过Selenium来获取浏览器的操控和解析能力,如果您对Selenium有一定的了解可以更好的帮助您提升trico script的一些能力。

脚本示例

- segment: #This is a trico script sample. A web calculator.
    name: '"calc"'
    args: {0: '/input'}
    body:
    - load: '"http://www.baidu.com"' #load page of Baidu
    - wait: {ele: '#su', type: '"presence"'} #wait element appear
    - fill: {ele: '#kw', value: '$/input'}
    - click: {ele: '#su'} #click search button
    - wait: {ele: '.op_new_val_screen_result', type: '"presence"'} #fetch the result
    - select: {ele: '.op_new_val_screen_result', path: '/result'}
    - put: { path: '/resultInt', value: '$/result', lambda: 'x -> return parseInt(x)'}
    - return: '$/resultInt'
- callin: {seg: '"calc"', 0: '"1+1"'} #call the segment named 'calc' and pass the argument '1+1'
- return: '$1'

如何使用

  • jar包调用

    1.在工程目录使用mvn clean package 打包

    2.cd target

    3.java -Dscript.path=../sample/webCalc.tr -Ddriver.path=../env/geckodriver-macOS -classpath salyut-jar-with-dependencies.jar com.trico.salyut.Salyut

    4.参数说明

    • script.path - 要执行脚本的路径 样例在sample中可以找到。
    • driver.path - driver路径 在env中可以找到。
    • segment.path - 预加载segment文件的路径。
    • headless - 是否无框启动浏览器。
    • browser.count - 可启动的浏览个数。
    • newTab.path - Salyut引擎创建新tab所需 在env中可以找到,需要全路径。
  • 代码调用

Salyut.setEnv(EnvKey.DRIVER_PATH,{your driver path});
Salyut.setEnv(EnvKey.NEW_TAB_PATH,{your newTab file path});

Salyut.launch();
Salyut.setOutputListener(
        msg -> {
            //引擎处理过程中产生的消息
        }
);
Salyut.setResultListener(
        result -> {
            //返回结果
        }
);
String script = {your script content};
TricoScript script = new TricoScript(yaml);
Salyut.execScript(script.getContent(),"","");

Maven

<dependency>
    <groupId>com.trico.salyut</groupId>
    <artifactId>salyut</artifactId>
    <version>0.0.8-SNAPSHOT</version>
</dependency>

架构

许可证

Salyut 使用 Apache 2.0 许可证,您可以免费下载,修改以及部署源代码。您还可以将Salyut Engine进行服务化封装,从而让Salyut Engine具有更强大的业务处理能力。

Salyut 还添加了Commons Clause 1.0条款,对于您通过Salyut Engine进行服务化封装后进行商业售卖做了限制。当然我们还是希望Salyut Engine能够更好的作出开源贡献,所以如果您对如何商用Salyut Engine有更好的想法,请随时与我们联系[email protected]

商业应用

目前Salyut已经应用于Trico Cloud商业平台,Trico Cloud致力于通过提供爬虫云原生服务,解决高可用,低成本数据抓取,并为爬虫开发者与使用者提供更好的爬虫生态。

联系方式

salyut's People

Contributors

shenruisi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

salyut's Issues

callin标签中key与value之间没有空格,报空指针,kernel没有抛出异常

callin标签中key与value之间没有空格,报空指针,kernel没有抛出异常,导致IDE一直处于running状态。卡住的假象。
示例代码(0:'"test"'之前没有空格)

- segment:
    # 特别注意testSegment的入口名必须是以“__test__”开头,(您也可以写多个Segment通过其中callin来进行调用)
    name: '"__test__appEdu"'
    package: '"school"'
    args: {}
    body:
        - callin: {seg: '"appEdu"', package: '"school"', 0:'"test"', 1: '"https://www.app.edu.gov.on.ca/eng/sift/schoolProfile.asp?SCH_NUMBER=000779"', 2: '"000779"', 3: '"Elementary"' }
        - put: {path: '/test', value: '$1'}
        - echo: '$/test'
        - if: '$/test != $null'
        - then: 
            - return: '$true'
        - else:
            - return: '$false'

kernel报错信息

1589967087774	mozrunner::runner	INFO	Running command: "/Applications/Firefox.app/Contents/MacOS/firefox-bin" "-marionette" "-foreground" "-no-remote" "-profile" "/var/folders/43/6g6jmkbj2qg6zy1cn_pqqmww0000gp/T/rust_mozprofileLq211j"
1589967088115	[email protected]	WARN	Loading extension '[email protected]': Reading manifest: Invalid extension permission: networkStatus
1589967088325	[email protected]	WARN	Loading extension '[email protected]': Reading manifest: Invalid extension permission: mozillaAddons
1589967088325	[email protected]	WARN	Loading extension '[email protected]': Reading manifest: Invalid extension permission: telemetry
1589967088325	[email protected]	WARN	Loading extension '[email protected]': Reading manifest: Invalid extension permission: resource://pdf.js/
1589967088325	[email protected]	WARN	Loading extension '[email protected]': Reading manifest: Invalid extension permission: about:reader*
1589967090030	Marionette	INFO	Listening on port 61627
1589967090137	Marionette	WARN	TLS certificate errors will be ignored for this session
五月 20, 2020 5:31:30 下午 org.openqa.selenium.remote.ProtocolHandshake createSession
信息: Detected dialect: W3C
2020-05-20 17:31:31,618 [release] - java.lang.NullPointerException

switchtab isn't work

when I user switchtab in my segment, I find that switchtab isn't work. I can't understard "index" .
This is my segment . please tell me how to find my tab by index.

- segment:
   name: '"calc"'
   args: {0: '/input'}
   body:
       #load page of Baidu
       - load: '"https://my.xiapibuy.com/Cushions-Covers-cat.26.681.9888?page=0&sortBy=sales"'
    
       - loop:
           in: {eles: 'div.shopee-search-item-result__items div.shopee-search-item-result__item'}
           each:
             - select: {path: '/tmp/url',under: '$e',ele: 'a[data-sqe=link]',attr: '"href"'}
             - put: {path: '/urls',value: '"window.open(\"" + $/tmp/url + "\")"'}
             - put: {path: '/tmp/count',value: '$i'}
             - echo: '$/urls'
             - js: '$/urls'
             - wait: {type: '"time"',millis: '4000'}
             - switchtemptab: {index: '1'}
             - wait: {type: '"presence"',ele: 'button.btn-solid-primary',timeout: '10000'}
             - find: {ele: '#main > div > div.shopee-page-wrapper > div.page-product > div.container > div.product-briefing.flex.card> div.flex.flex-auto > div > div > div:nth-child(1) > div'}
             - if: '$1'
             - then:
                 - select: {ele: '#main > div > div.shopee-page-wrapper > div.page-product > div.container > div.product-briefing.flex.card> div.flex.flex-auto > div > div > div:nth-child(1) > div',path: '/tmp/rank'}
                 - contains: {path: '/tmp/rank',search: '"RM"'}
                 - if: '$1'
                 - then:
                     - put: {path: '/tmp/rank',value: '"无评分"'}
             - else:
                 - put: {path: '/tmp/rank',value: '"无评分"'}
             - find: {ele: '#main > div > div.shopee-page-wrapper > div.page-product > div.container > div.product-briefing.flex.card> div.flex.flex-auto > div > div > div:nth-child(3) > div:nth-child(1)'}
             - if: '$1'
             - then:
                 - select: {path: '/tmp/sale',ele: '#main > div > div.shopee-page-wrapper > div.page-product > div.container > div.product-briefing.flex.card> div.flex.flex-auto > div > div > div:nth-child(3) > div:nth-child(1)'}
             - else:
                 - put: {path: '/tmp/sale',value: '"无"'}
             - find: {ele: '#main > div > div.shopee-page-wrapper > div.page-product > div.container > div.product-briefing.flex.card> div.flex.flex-auto > div > div > div:nth-child(2) > div:nth-child(1)'}
             - if: '$1'
             - then:
                 - select: {path: '/tmp/rant',ele: '#main > div > div.shopee-page-wrapper > div.page-product > div.container > div.product-briefing.flex.card> div.flex.flex-auto > div > div > div:nth-child(2) > div:nth-child(1)'}
             - else:
                 - put: {path: '/tmp/rant',value: '"无"'}
             - find: {ele: '#main > div > div.shopee-page-wrapper > div.page-product > div.container > div.product-briefing.flex.card > div.flex.flex-auto > div > div:nth-child(3) > div > div > div > div > div.flex.items-center > div:nth-child(1)'}
             - if: '$1'
             - then:
                 - select: {path: '/tmp/price',ele: '#main > div > div.shopee-page-wrapper > div.page-product > div.container > div.product-briefing.flex.card > div.flex.flex-auto > div > div:nth-child(3) > div > div > div > div > div.flex.items-center > div:nth-child(1)'}
             - else: 
                 - put: {path: '/tmp/price',value: '"无售价"'}
             - find: {ele: '#main > div > div.shopee-page-wrapper > div.page-product > div.container > div.product-briefing.flex.card > div.flex.flex-auto > div > div > span'}
             - if: '$1'
             - then:
                - select: {path: '/tmp/title',ele: '#main > div > div.shopee-page-wrapper > div.page-product > div.container > div.product-briefing.flex.card > div.flex.flex-auto > div > div > span'}
             - else:
                 - put: {path: '/tmp/title',value: '"未获取到"'}
             - put: {path: '/R/list[-1]',value: '$/tmp'}
             - remove: {path: '/tmp'}
             - closetemptab:
       - export: {type: '"excel"',path: '/R/list/',value: 'count,title,price,sale,rank,rant,url'}
       - return: '$/R'
#call the segment named 'calc' and pass the argument '1+1'
- callin: {seg: '"calc"', 0: '1+1'}
- echo: '$1'   

Segment解析后格式化问题

name前面有6个空格占位,formatYaml此方法中, builder.append(lines[i]); 去掉lines[i]原有空格

private static String formatYaml(String yamlString){
        StringBuilder builder = new StringBuilder();
        String[] lines = yamlString.split("\n");
        builder.append(lines[0]);
        builder.append("\n");
        for(int i = 1; i < lines.length; i++){
            builder.append("    ");
            builder.append(lines[i]);
            builder.append("\n");
        }
        return builder.toString();
    }

STANDARD_IDENT 应该为4个空格占位,以下代码中只有3个占位。

private static final String STANDARD_IDENT = " ";

最后结果

- segment: 
      name: '"__test__baidunews"'
      args: 
      package: '"tom"'
      body: 
         - load: '"http://news.baidu.com/"'
         - select: { eles: '.ulist li',path: '/arr' }
         - len: '$/arr'
         - if: '$1 > 0'
         - then: 
            - return: '$true'
         - else: 
            - return: '$false'

private static final String STANDARD_IDENT = " ";

没有String转数字token

目前使用lambda转,可以提供字符串和数字互转的token

- put: {path: '/number' , value: '$/str', lambda: 'x -> return Number(x)+0.01;'}

js支持返回值

建议js支持执行完通过return返回一个String类型的值 js改成类似 - js: { path:'/res', str: '"var a=1; return a;"' } 的写法 将a的值保存在变量path内

loop内使用if then continue死循环

# 以下是一个打开页面的segment, 请注意以下注释的注意点
- segment:
    name: '"new_animation_update_time"'
    package: '"Kiva"'
    args: {0: '/url'}
    body:
        - load: '$/url'
        - loop:
            in: { eles: '.tl-day' }
            each:
                - select: { ele: '$e', attr: '"class"', path: '/clazz' }
                - if: '$/clazz == "tl-day today"'
                - then:
                    - put: { path: '/todayIndex', value: '$i', lambda: 'x -> return Number(x)' }
                    - break:
        - echo: '"todayIndex: " + $/todayIndex'        
        - loop:
            in: { eles: '.season-timeline' }
            each:
                - echo: '"i:" + $i'
                - echo: '$/todayIndex'
                - if: '$i < $/todayIndex'
                - then:
                    - continue:
                - put: { path: '/index', value: '$i'}
                - echo: '"else:" + $/index'
                - select: { eles: '.season-group:not(.season-timer)', under: '$e', path: '/groupList' }
#                 - loop:
#                     in: { values: '$/groupList', mutable: '$false' }
#                     each:
#                         - select: { ele: '.group-time', under: '$e', path: '/time' }
#                         - select: { eles: '.season-item', under: '$e', path: '/items' }
#                         - put: { path: '/head', value: '$/headList[$/index]'}
#                         - select: { ele: '.t-date', under: '$/head', path: '/date' }
#                         - select: { ele: '.t-week', under: '$/head', path: '/week' }
#                         - echo: '$/date'
#                         - loop:
#                             in: { values: '$/items', mutable: '$false' }
#                             each:
#                                 - select: { ele: 'img', attr: 'src', under: '$v', path: '/item/pic' }
#                                 - select: { ele: '.season-title', under: '$v', path: '/item/title' }
#                                 - select: { ele: '.season-desc', under: '$v', path: '/item/desc' }
#                                 - put: { path: '/item/date', value: '$/date' }
#                                 - put: { path: '/item/week', value: '$/week'}
#                                 - put: { path: '/resList[-1]', value: '$/item' }
        - return: '$/resList'

- segment:
    name: '"__test__new_animation_update_time"'
    package: '"Kiva"'
    body:
        - callin: {seg: '"new_animation_update_time"', 0: '"https://www.bilibili.com/anime/timeline"', package: '"Kiva"'}
        - echo: '$1'
        
# 在trico后台要提交segment的时候,请将以下这一行的callin注释掉。
- callin: {seg: '"__test__new_animation_update_time"', package: '"Kiva"'}

loop 的 start属性不生效

下面那个loop用了start参数 按照输出$/startIndex为6 但是还是从0开始 之后超出数组界限报错

# 以下是一个打开页面的segment, 请注意以下注释的注意点
- segment:
    name: '"new_animation_update_time"'
    package: '"Kiva"'
    args: {0: '/url'}
    body:
        - load: '$/url'
        - loop:
            in: { eles: '.tl-day' }
            each:
                - select: { ele: '.t-date', parent: '$e', path: '/date' }
                - select: { ele: '.t-week', parent: '$e', path: '/week' }
                - put: { path: '/time/date', value: '$/date' }
                - put: { path: '/time/week', value: '$/week' }
                - put: { path: '/list[$i]', value: '$/time' }
                - remove: { path: '/time' }
                - select: { target: '$e', attr: '"class"', path: '/tlDayClazz' }
                - if: '$/tlDayClazz == "tl-day today"'
                - then:
                    - put: { path: '/startIndex', value: '$i' }
        - loop:
            in: { eles: '.tl-day' }
            each:
                - if: '$i >= $/startIndex'
                - then:
                    - put: { path: '/res[-1]', value: '$/list[$i]' }
        - echo: '$/res'
        - echo: '$/startIndex'
        - loop:
            in: { eles: '.season-timeline', start: '$/startIndex' }
            each:
                - echo: '$i'
                - select: { ele: '$e', path: '/now' }
                - echo: '$/now'
                - remove: { path: '/now' }
                - loop:
                    in: { eles: '.season-group:not(.season-timer)', parent: '$e' }
                    each:
                        - loop:
                            in: { eles: '.season-item', parent: '$e' }
                            each:
                                - select: { ele: 'img', attr: '"src"', parent: '$e', path: '/item/imgSrc' }
                                - select: { ele: '.season-title', parent: '$e', path: '/item/title' }
                                - select: { ele: '.season-desc', parent: '$e', path: '/item/desc' }
                                - put: { path: '/animationList[-1]', value: '$/item' }
                                - remove: { path: '/item' }
                - put: { path: '/now', value: '$/res[$i]' }
                - put: { path: '/now/animationList', value: '$/animationList' }
                - remove: { path: '/animationList' }
                - remove: { path: '/now' }
               
                    
- segment:
    name: '"__test__new_animation_update_time"'
    package: '"Kiva"'
    body:
        - callin: {seg: '"new_animation_update_time"', 0: '"https://www.bilibili.com/anime/timeline"', package: '"Kiva"'}
        - echo: '$1'
        
# 在trico后台要提交segment的时候,请将以下这一行的callin注释掉。
- callin: {seg: '"__test__new_animation_update_time"', package: '"Kiva"'}

Segment中args为空时(即"args: {}"),经过parser()之后,变成"args:",丢失了"{}"

原Segment

- segment: 
      name: '"__test__baidunews"'
      args:  {}
      package: '"tom"'
      body: 
         - load: '"http://news.baidu.com/"'

经过parser() 方法调用后

- segment: 
      name: '"__test__baidunews"'
      args: 
      package: '"tom"'
      body: 
         - load: '"http://news.baidu.com/"'

可能问题:args当做属性处理

private static final String ATTRIBUTE_FORMAT = "%s: ";

 else if (KeyType.ATTRIBUTE.equals(keyType)){
     builder.append(String.format(ATTRIBUTE_FORMAT,tuple2.f0));
 }

readme中example代码是否有问题

Salyut.setEnv(EnvKey.DRIVER_PATH,{your driver path});
Salyut.setEnv(EnvKey.NEW_TAB_PATH,{your newTab file path});

Salyut.launch();
Salyut.setOutputListener(
        msg -> {
            //引擎处理过程中产生的消息
        }
);
Salyut.setResultListener(
        result -> {
            //返回结果
        }
);
String script = {your script content};
TricoScript script = new TricoScript(yaml);
Salyut.execScript(script.getContent(),"","");

String script = {your script content};
TricoScript script = new TricoScript(yaml);
这里是否笔误?

遍历调用带有参数的Segment,均执行第一次调用的Segment的参数

segment 参数一直保留着第一次传进去的值

部分示例如下:

- segment:
    package: '"school"'
    name: '"getReportCardData"'
    args: {}
    body:
        - loop:
            in: {eles: '.row table#insidetable2', mutable: '$false'}
            each:
                - select: { ele: 'caption', under: '$e', path: '/indexType'}
                - echo: '$/indexType'
                - put: { path: '/index', value: '$i'}
                - js: '""'
                - callin: {package: '"school"', seg: '"getTableData"', 0: '$/indexType', 1: '$/index'}
                - put: {path: '/cardTableList', value: '$1'}
                - if: '$/cardTableList != $null'
                - then:
                    - loop:
                        in: { values: '$/cardTableList'}
                        each:
                            - copy: { path: '/reportCardList[-1]', value: '$v'} # $V循环非web element元素的数组元素值
        - return: '$/reportCardList'


- segment:
    package: '"school"'
    name: '"getTableData"'
    args: { 0: '/indexType', 1: '/index'}
    body:
        - echo: '"indexType="+$/indexType'
        - echo: '"getTableData="+$/index'
        - if: '$/index == 0'
        - then:
            - loop:
                in: { eles: 'tbody#tbody_0 tr', mutable: '$false'}
                each:
                    - if: '$i>1'
                    - then:
                        - put: { path: '/tableRow/indexType', value: '$/indexType'}
                        - select: { path: '/tableRow/indexDesc', ele: 'td:nth-child(1)', under: '$e'}
                        - select: { path: '/tableRow/school', ele: 'td:nth-child(2)', under: '$e'}
                        - select: { path: '/tableRow/englishStudents', ele: 'td:nth-child(3)', under: '$e'}
                        - select: { path: '/tableRow/frenchStudents', ele: 'td:nth-child(4)', under: '$e'}
                        - put: { path: '/tableList[-1]', value: '$/tableRow'}
                        - remove: { path: '/tableRow'}
            - return: '$/tableList'
        - if: '$/index == 1'
        - then:
            - loop:
                in: { eles: 'tbody#tbody_1 tr', mutable: '$false'}
                each:
                    - if: '$i>1'
                    - then:
                        - put: { path: '/tableRow/indexType', value: '$/indexType'}
                        - select: { path: '/tableRow/indexDesc', ele: 'td:nth-child(1)', under: '$e'}
                        - select: { path: '/tableRow/school', ele: 'td:nth-child(2)', under: '$e'}
                        - select: { path: '/tableRow/englishStudents', ele: 'td:nth-child(3)', under: '$e'}
                        - select: { path: '/tableRow/frenchStudents', ele: 'td:nth-child(4)', under: '$e'}
                        - put: { path: '/tableList[-1]', value: '$/tableRow'}
                        - remove: { path: '/tableRow'}
            - return: '$/tableList'
        - if: '$/index == 2'
        - then:
            - loop:
                in: { eles: 'tbody#tbody_2 tr', mutable: '$false'}
                each:
                    - if: '$i>1'
                    - then:
                        - put: { path: '/tableRow/indexType', value: '$/indexType'}
                        - select: { path: '/tableRow/indexDesc', ele: 'td:nth-child(1)', under: '$e'}
                        - select: { path: '/tableRow/school', ele: 'td:nth-child(2)', under: '$e'}
                        - select: { path: '/tableRow/englishStudents', ele: 'td:nth-child(3)', under: '$e'}
                        - select: { path: '/tableRow/frenchStudents', ele: 'td:nth-child(3)', under: '$e'}
                        - put: { path: '/tableList[-1]', value: '$/tableRow'}
                        - remove: { path: '/tableRow'}
            - return: '$/tableList'            

使用mobile标签出现不停重启浏览器

使用mobile标签后运行脚本出现不停重启浏览器,如

- segment: 
        name: '"taobaoItemDescTest"'
        package: '"aichuanda"'
        args: { 0: '/test' }
        body: 
            - mobile:
            - load: '"https://main.m.taobao.com/index.html"'

Add your first token - distinct

Please try to add the token distinct, function description is below.

SYNOPSIS
- distinct: { path: , value: }

DESCRIPTION
Distinct the values in the list

ATTRIBUTES

name type required
path path
value expr

EXAMPLE
Input: [1,1,2,2,3]
Output: [1,2,3]

- distinct: { path: '/distinct', value: '$/someInputArr' }

提交代码后脚本格式化问题

  • segment下级name、body等标签有8个空格缩进,以及- loop:下级in、each等标签也有8个空格缩进

如下代码

- segment: 
        package: '"taobaoSearch"'
        name: '"pailitaoSearchPic"'
        args: { 0: '/url',1: '/isTest',2: '/img',3: '/buyingId' }
        body: 
            - loop: 
                    in: { eles: '#imgsearch-itemlist .item',mutable: '$false' }
                    each: 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.