platonai / pulsarrpa Goto Github PK
View Code? Open in Web Editor NEWAutomate webpages at scale, scrape web data completely and accurately with high performance, distributed RPA.
License: GNU Affero General Public License v3.0
Automate webpages at scale, scrape web data completely and accurately with high performance, distributed RPA.
License: GNU Affero General Public License v3.0
"C:\Program Files\Java\jdk-20\bin\java.exe" "-javaagent:C:\Program Files\JetBrains\IntelliJ IDEA Community Edition 2023.1.3\lib\idea_rt.jar=3907:C:\Program Files\JetBrains\IntelliJ IDEA Community Edition 2023.1.3\bin" -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.stderr.encoding=UTF-8 -classpath C:\Users\Administrator\IdeaProjects\PulsarContexts\target\classes;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-stdlib-jdk8\1.8.21\kotlin-stdlib-jdk8-1.8.21.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-stdlib\1.8.21\kotlin-stdlib-1.8.21.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-stdlib-common\1.8.21\kotlin-stdlib-common-1.8.21.jar;C:\Users\Administrator.m2\repository\org\jetbrains\annotations\13.0\annotations-13.0.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-stdlib-jdk7\1.8.21\kotlin-stdlib-jdk7-1.8.21.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-skeleton\1.10.12\pulsar-skeleton-1.10.12.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-common\1.10.12\pulsar-common-1.10.12.jar;C:\Users\Administrator.m2\repository\org\springframework\spring-core\5.3.17\spring-core-5.3.17.jar;C:\Users\Administrator.m2\repository\org\springframework\spring-jcl\5.3.17\spring-jcl-5.3.17.jar;C:\Users\Administrator.m2\repository\xml-apis\xml-apis\1.3.04\xml-apis-1.3.04.jar;C:\Users\Administrator.m2\repository\org\apache\httpcomponents\httpclient\4.5.13\httpclient-4.5.13.jar;C:\Users\Administrator.m2\repository\org\apache\httpcomponents\httpcore\4.4.13\httpcore-4.4.13.jar;C:\Users\Administrator.m2\repository\commons-logging\commons-logging\1.2\commons-logging-1.2.jar;C:\Users\Administrator.m2\repository\commons-codec\commons-codec\1.11\commons-codec-1.11.jar;C:\Users\Administrator.m2\repository\com\ibm\icu\icu4j\4.0.1\icu4j-4.0.1.jar;C:\Users\Administrator.m2\repository\commons-io\commons-io\2.11.0\commons-io-2.11.0.jar;C:\Users\Administrator.m2\repository\org\apache\commons\commons-lang3\3.12.0\commons-lang3-3.12.0.jar;C:\Users\Administrator.m2\repository\org\apache\commons\commons-math3\3.3\commons-math3-3.3.jar;C:\Users\Administrator.m2\repository\org\codehaus\woodstox\stax2-api\4.2.1\stax2-api-4.2.1.jar;C:\Users\Administrator.m2\repository\com\fasterxml\woodstox\woodstox-core\6.4.0\woodstox-core-6.4.0.jar;C:\Users\Administrator.m2\repository\com\fasterxml\jackson\module\jackson-module-kotlin\2.13.4\jackson-module-kotlin-2.13.4.jar;C:\Users\Administrator.m2\repository\com\fasterxml\jackson\core\jackson-databind\2.13.4\jackson-databind-2.13.4.jar;C:\Users\Administrator.m2\repository\com\fasterxml\jackson\core\jackson-annotations\2.13.4\jackson-annotations-2.13.4.jar;C:\Users\Administrator.m2\repository\com\fasterxml\jackson\dataformat\jackson-dataformat-properties\2.13.4\jackson-dataformat-properties-2.13.4.jar;C:\Users\Administrator.m2\repository\com\fasterxml\jackson\core\jackson-core\2.13.4\jackson-core-2.13.4.jar;C:\Users\Administrator.m2\repository\com\fasterxml\jackson\datatype\jackson-datatype-jsr310\2.13.4\jackson-datatype-jsr310-2.13.4.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-serialization\1.5.32\kotlin-serialization-1.5.32.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-gradle-plugin-api\1.5.32\kotlin-gradle-plugin-api-1.5.32.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-native-utils\1.5.32\kotlin-native-utils-1.5.32.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-util-io\1.5.32\kotlin-util-io-1.5.32.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-project-model\1.5.32\kotlin-project-model-1.5.32.jar;C:\Users\Administrator.m2\repository\org\nibor\autolink\autolink\0.10.0\autolink-0.10.0.jar;C:\Users\Administrator.m2\repository\ch\qos\logback\logback-classic\1.2.9\logback-classic-1.2.9.jar;C:\Users\Administrator.m2\repository\ch\qos\logback\logback-core\1.2.9\logback-core-1.2.9.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-persist\1.10.12\pulsar-persist-1.10.12.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\gora-shaded-mongodb\0.8\gora-shaded-mongodb-0.8.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-jsoup\1.14.3\pulsar-jsoup-1.14.3.jar;C:\Users\Administrator.m2\repository\org\apache\avro\avro\1.8.1\avro-1.8.1.jar;C:\Users\Administrator.m2\repository\org\codehaus\jackson\jackson-core-asl\1.9.13\jackson-core-asl-1.9.13.jar;C:\Users\Administrator.m2\repository\org\codehaus\jackson\jackson-mapper-asl\1.9.13\jackson-mapper-asl-1.9.13.jar;C:\Users\Administrator.m2\repository\com\thoughtworks\paranamer\paranamer\2.7\paranamer-2.7.jar;C:\Users\Administrator.m2\repository\org\xerial\snappy\snappy-java\1.1.1.3\snappy-java-1.1.1.3.jar;C:\Users\Administrator.m2\repository\org\tukaani\xz\1.5\xz-1.5.jar;C:\Users\Administrator.m2\repository\org\apache\gora\gora-core\0.8\gora-core-0.8.jar;C:\Users\Administrator.m2\repository\org\apache\cxf\cxf-rt-frontend-jaxrs\2.5.2\cxf-rt-frontend-jaxrs-2.5.2.jar;C:\Users\Administrator.m2\repository\org\apache\cxf\cxf-common-utilities\2.5.2\cxf-common-utilities-2.5.2.jar;C:\Users\Administrator.m2\repository\org\apache\ws\xmlschema\xmlschema-core\2.0.1\xmlschema-core-2.0.1.jar;C:\Users\Administrator.m2\repository\org\codehaus\woodstox\woodstox-core-asl\4.1.1\woodstox-core-asl-4.1.1.jar;C:\Users\Administrator.m2\repository\org\apache\cxf\cxf-api\2.5.2\cxf-api-2.5.2.jar;C:\Users\Administrator.m2\repository\org\apache\neethi\neethi\3.0.1\neethi-3.0.1.jar;C:\Users\Administrator.m2\repository\wsdl4j\wsdl4j\1.6.2\wsdl4j-1.6.2.jar;C:\Users\Administrator.m2\repository\org\apache\cxf\cxf-rt-core\2.5.2\cxf-rt-core-2.5.2.jar;C:\Users\Administrator.m2\repository\com\sun\xml\bind\jaxb-impl\2.1.13\jaxb-impl-2.1.13.jar;C:\Users\Administrator.m2\repository\org\apache\geronimo\specs\geronimo-javamail_1.4_spec\1.7.1\geronimo-javamail_1.4_spec-1.7.1.jar;C:\Users\Administrator.m2\repository\javax\ws\rs\jsr311-api\1.1.1\jsr311-api-1.1.1.jar;C:\Users\Administrator.m2\repository\org\apache\cxf\cxf-rt-bindings-xml\2.5.2\cxf-rt-bindings-xml-2.5.2.jar;C:\Users\Administrator.m2\repository\org\apache\cxf\cxf-rt-transports-http\2.5.2\cxf-rt-transports-http-2.5.2.jar;C:\Users\Administrator.m2\repository\org\apache\cxf\cxf-rt-transports-common\2.5.2\cxf-rt-transports-common-2.5.2.jar;C:\Users\Administrator.m2\repository\org\springframework\spring-web\3.0.6.RELEASE\spring-web-3.0.6.RELEASE.jar;C:\Users\Administrator.m2\repository\aopalliance\aopalliance\1.0\aopalliance-1.0.jar;C:\Users\Administrator.m2\repository\org\codehaus\jettison\jettison\1.3.1\jettison-1.3.1.jar;C:\Users\Administrator.m2\repository\org\apache\avro\avro-mapred\1.8.1\avro-mapred-1.8.1.jar;C:\Users\Administrator.m2\repository\org\apache\avro\avro-ipc\1.8.1\avro-ipc-1.8.1.jar;C:\Users\Administrator.m2\repository\org\mortbay\jetty\jetty\6.1.26\jetty-6.1.26.jar;C:\Users\Administrator.m2\repository\org\mortbay\jetty\jetty-util\6.1.26\jetty-util-6.1.26.jar;C:\Users\Administrator.m2\repository\io\netty\netty\3.5.13.Final\netty-3.5.13.Final.jar;C:\Users\Administrator.m2\repository\commons-lang\commons-lang\2.6\commons-lang-2.6.jar;C:\Users\Administrator.m2\repository\org\apache\gora\gora-compiler\0.8\gora-compiler-0.8.jar;C:\Users\Administrator.m2\repository\org\apache\avro\avro-compiler\1.8.1\avro-compiler-1.8.1.jar;C:\Users\Administrator.m2\repository\org\apache\velocity\velocity\1.7\velocity-1.7.jar;C:\Users\Administrator.m2\repository\joda-time\joda-time\2.7\joda-time-2.7.jar;C:\Users\Administrator.m2\repository\org\jgrapht\jgrapht-core\1.0.0\jgrapht-core-1.0.0.jar;C:\Users\Administrator.m2\repository\org\jgrapht\jgrapht-ext\1.0.0\jgrapht-ext-1.0.0.jar;C:\Users\Administrator.m2\repository\org\tinyjee\jgraphx\jgraphx\2.0.0.1\jgraphx-2.0.0.1.jar;C:\Users\Administrator.m2\repository\jgraph\jgraph\5.13.0.0\jgraph-5.13.0.0.jar;C:\Users\Administrator.m2\repository\org\antlr\antlr4-runtime\4.5.3\antlr4-runtime-4.5.3.jar;C:\Users\Administrator.m2\repository\org\springframework\spring-context\5.3.17\spring-context-5.3.17.jar;C:\Users\Administrator.m2\repository\org\springframework\spring-aop\5.3.17\spring-aop-5.3.17.jar;C:\Users\Administrator.m2\repository\org\springframework\spring-beans\5.3.17\spring-beans-5.3.17.jar;C:\Users\Administrator.m2\repository\org\springframework\spring-expression\5.3.17\spring-expression-5.3.17.jar;C:\Users\Administrator.m2\repository\javax\xml\bind\jaxb-api\2.3.1\jaxb-api-2.3.1.jar;C:\Users\Administrator.m2\repository\javax\activation\javax.activation-api\1.2.0\javax.activation-api-1.2.0.jar;C:\Users\Administrator.m2\repository\commons-collections\commons-collections\3.2.2\commons-collections-3.2.2.jar;C:\Users\Administrator.m2\repository\org\apache\hadoop\hadoop-common\2.7.2\hadoop-common-2.7.2.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-dom\1.10.12\pulsar-dom-1.10.12.jar;C:\Users\Administrator.m2\repository\com\udojava\EvalEx\2.0\EvalEx-2.0.jar;C:\Users\Administrator.m2\repository\org\perf4j\perf4j\0.9.16\perf4j-0.9.16.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-browser\1.10.12\pulsar-browser-1.10.12.jar;C:\Users\Administrator.m2\repository\io\dropwizard\metrics\metrics-core\4.1.29\metrics-core-4.1.29.jar;C:\Users\Administrator.m2\repository\javax\websocket\javax.websocket-api\1.1\javax.websocket-api-1.1.jar;C:\Users\Administrator.m2\repository\org\glassfish\tyrus\tyrus-container-grizzly-client\1.13.1\tyrus-container-grizzly-client-1.13.1.jar;C:\Users\Administrator.m2\repository\org\glassfish\grizzly\grizzly-framework\2.3.25\grizzly-framework-2.3.25.jar;C:\Users\Administrator.m2\repository\org\glassfish\grizzly\grizzly-http-server\2.3.25\grizzly-http-server-2.3.25.jar;C:\Users\Administrator.m2\repository\org\glassfish\grizzly\grizzly-http\2.3.25\grizzly-http-2.3.25.jar;C:\Users\Administrator.m2\repository\org\glassfish\tyrus\tyrus-client\1.13.1\tyrus-client-1.13.1.jar;C:\Users\Administrator.m2\repository\org\glassfish\tyrus\tyrus-core\1.13.1\tyrus-core-1.13.1.jar;C:\Users\Administrator.m2\repository\org\glassfish\tyrus\tyrus-spi\1.13.1\tyrus-spi-1.13.1.jar;C:\Users\Administrator.m2\repository\com\github\kklisura\cdt\cdt-java-client\4.0.0\cdt-java-client-4.0.0.jar;C:\Users\Administrator.m2\repository\org\javassist\javassist\3.24.1-GA\javassist-3.24.1-GA.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-ql-common\1.10.12\pulsar-ql-common-1.10.12.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-h2\1.4.196\pulsar-h2-1.4.196.jar;C:\Users\Administrator.m2\repository\org\apache\commons\commons-collections4\4.4\commons-collections4-4.4.jar;C:\Users\Administrator.m2\repository\com\google\code\crawler-commons\crawler-commons\0.5\crawler-commons-0.5.jar;C:\Users\Administrator.m2\repository\org\apache\tika\tika-core\1.6\tika-core-1.6.jar;C:\Users\Administrator.m2\repository\org\slf4j\slf4j-api\1.7.7\slf4j-api-1.7.7.jar;C:\Users\Administrator.m2\repository\com\google\guava\guava\30.1-jre\guava-30.1-jre.jar;C:\Users\Administrator.m2\repository\com\google\guava\failureaccess\1.0.1\failureaccess-1.0.1.jar;C:\Users\Administrator.m2\repository\com\google\guava\listenablefuture\9999.0-empty-to-avoid-conflict-with-guava\listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar;C:\Users\Administrator.m2\repository\com\google\code\findbugs\jsr305\3.0.2\jsr305-3.0.2.jar;C:\Users\Administrator.m2\repository\org\checkerframework\checker-qual\3.5.0\checker-qual-3.5.0.jar;C:\Users\Administrator.m2\repository\com\google\errorprone\error_prone_annotations\2.3.4\error_prone_annotations-2.3.4.jar;C:\Users\Administrator.m2\repository\com\google\j2objc\j2objc-annotations\1.3\j2objc-annotations-1.3.jar;C:\Users\Administrator.m2\repository\com\google\code\gson\gson\2.10.1\gson-2.10.1.jar;C:\Users\Administrator.m2\repository\oro\oro\2.0.8\oro-2.0.8.jar;C:\Users\Administrator.m2\repository\com\beust\jcommander\1.81\jcommander-1.81.jar;C:\Users\Administrator.m2\repository\com\github\oshi\oshi-core\5.6.1\oshi-core-5.6.1.jar;C:\Users\Administrator.m2\repository\net\java\dev\jna\jna\5.8.0\jna-5.8.0.jar;C:\Users\Administrator.m2\repository\net\java\dev\jna\jna-platform\5.8.0\jna-platform-5.8.0.jar;C:\Users\Administrator.m2\repository\io\dropwizard\metrics\metrics-graphite\4.1.29\metrics-graphite-4.1.29.jar;C:\Users\Administrator.m2\repository\com\rabbitmq\amqp-client\5.14.0\amqp-client-5.14.0.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlinx\kotlinx-coroutines-jdk8\1.6.4\kotlinx-coroutines-jdk8-1.6.4.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlinx\kotlinx-coroutines-core-jvm\1.6.4\kotlinx-coroutines-core-jvm-1.6.4.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-reflect\1.5.32\kotlin-reflect-1.5.32.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlinx\kotlinx-coroutines-core\1.6.4\kotlinx-coroutines-core-1.6.4.jar ai.platon.pulsar.examples.sites.topEc.english.amazon.MainKt
16:14:22.745 [main] INFO ai.platon.pulsar.common.config.AbstractConfiguration - Find legacy resource: jar:file:/C:/Users/Administrator/.m2/repository/ai/platon/pulsar/pulsar-skeleton/1.10.12/pulsar-skeleton-1.10.12.jar!/config/legacy/pulsar-default.xml
16:14:22.748 [main] INFO ai.platon.pulsar.common.config.AbstractConfiguration - Find legacy resource: jar:file:/C:/Users/Administrator/.m2/repository/ai/platon/pulsar/pulsar-skeleton/1.10.12/pulsar-skeleton-1.10.12.jar!/config/legacy/pulsar-site.xml
16:14:22.749 [main] INFO ai.platon.pulsar.common.config.AbstractConfiguration - Resource not find: pulsar-task.xml
16:14:22.774 [main] INFO ai.platon.pulsar.common.config.AbstractConfiguration - profile: <> | [pulsar-default.xml, pulsar-site.xml]
16:14:22.792 [main] INFO ai.platon.pulsar.crawl.protocol.ProtocolFactory - Supported protocols:
16:14:22.812 [main] INFO ai.platon.pulsar.crawl.parse.html.PrimerHtmlParser - className: PrimerHtmlParser defaultCharEncoding: utf-8
16:14:22.879 [main] INFO ai.platon.pulsar.crawl.parse.PageParser - maxParseTime: PT1M maxParsedLinks: 200 groupMode: BY_HOST ignoreExternalLinks: false maxUrlLength: 1024
16:14:22.904 [main] INFO ai.platon.pulsar.crawl.impl.StreamingCrawlLoop - Crawl loop is created | @977552154
16:14:22.906 [main] DEBUG org.springframework.context.support.StaticApplicationContext - Refreshing org.springframework.context.support.StaticApplicationContext@58651fd0
16:14:22.953 [main] INFO ai.platon.pulsar.context.PulsarContexts - Active context | ai.platon.pulsar.context.support.StaticPulsarContext#1
16:14:23.985 [main] INFO ai.platon.pulsar.persist.gora.GoraStorage - Backend data store: FileBackendPageStore realSchema: FileBackendPageStore
16:14:24.112 [main] INFO ai.platon.pulsar.persist.AutoDetectStorageProvider - Storage is created: class ai.platon.pulsar.persist.gora.FileBackendPageStore realSchema: FileBackendPageStore
16:14:24.188 [main] INFO ai.platon.pulsar.crawl.component.LoadComponent.Task - 3. 💔 💿 U got 1600 0 <- 0 in , fc:1 ProtoNotFound(1600) | https://www.amazon.com/Best-Sellers/zgbs -outLinkSelector a[href~=/dp/]
16:14:24.188 [main] INFO ai.platon.pulsar.crawl.component.LoadComponent.Task - Log explanation: https://github.com/platonai/pulsarr/blob/master/docs/log-format.adoc
16:14:24.307 [main] INFO ai.platon.pulsar.crawl.impl.StreamingCrawlLoop - Registered 15 link collectors | loop#1 @977552154
[]
16:14:24.330 [SpringContextShutdownHook] DEBUG org.springframework.context.support.StaticApplicationContext - Closing org.springframework.context.support.StaticApplicationContext@58651fd0, started on Sun Jun 25 16:14:22 CST 2023
16:14:24.330 [Thread-0] INFO ai.platon.pulsar.context.support.AbstractPulsarContext - Closing context #1/2 | StaticPulsarContext
16:14:24.331 [Thread-0] INFO ai.platon.pulsar.session.AbstractPulsarSession - Session is closed | #1000002
16:14:24.331 [Thread-0] INFO ai.platon.pulsar.session.AbstractPulsarSession - Session is closed | #1000001
16:14:24.331 [DefaultDispatcher-worker-1] INFO ai.platon.pulsar.crawl.impl.StreamingCrawler - Starting StreamingCrawler #1 ...
Process finished with exit code 0
提示协议未找到,大概什么原因
疑问:
谢谢。
"C:\Program Files\Google\Chrome\Application\chrome.exe" --disable-gpu --hide-scrollbars --remote-debugging-port=0 --no-default-browser-check --no-first-run --no-startup-window --mute-audio --disable-background-networking --disable-background-timer-throttling --disable-client-side-phishing-detection --disable-hang-monitor --disable-popup-blocking --disable-prompt-on-repost --disable-sync --disable-translate --disable-blink-features=AutomationControlled --metrics-recording-only --safebrowsing-disable-auto-update --no-sandbox --ignore-certificate-errors --window-size=1920,1080 --pageLoadStrategy=none --throwExceptionOnScriptError=true
看例子有一个TaobaoLoginHandler可以实现自动登录,但是采用这种方式经常会被网站反爬识别到,需要多加一个验证码,是否可以实现
不自动登录,由人工登录后,系统再进行采集动作?如何监听人工登录的状态?谢谢。
2022-10-15 14:44:12.146 WARN [-worker-12] a.p.p.p.b.e.i.BrowserEmulatorImplBase - java.nio.file.FileSystemException: C:\Users\VINCEN~1\AppData\Local\Temp\ln\5a6caaaaa8aaf6e230182a2bbad7c43c.htm: 客户端没有所需的特权。
Environment:
OS: Windows 11
JDK: Java 11
Commnad line: "C:\Program Files\Java\jdk-11.0.2\bin\java.exe" "-javaagent:D:\Program Files\JetBrains\IntelliJ IDEA 2022.1.3\lib\idea_rt.jar=61275:D:\Program Files\JetBrains\IntelliJ IDEA 2022.1.3\bin" -Dfile.encoding=UTF-8 -classpath "..." ai.platon.exotic.examples.sites.walmart.WalmartCrawlerKt
21:19:13.336 [main] INFO a.p.p.b.driver.chrome.ChromeLauncher - User data dir does not exist, copy from prototype | /tmp/pulsar-vincent/context/cx.1iAggh21/pulsar_chrome <- /home/vincent/.pulsar/browser/chrome/prototype/google-chrome
21:19:18.060 [main] WARN a.p.p.b.driver.chrome.ChromeLauncher - Failed to prepare user data dir
java.lang.IllegalArgumentException: Parameter 'srcFile' is not a file: /home/vincent/.pulsar/browser/chrome/prototype/google-chrome/SingletonSocket
at org.apache.commons.io.FileUtils.requireFile(FileUtils.java:2737)
at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:841)
at org.apache.commons.io.FileUtils.doCopyDirectory(FileUtils.java:1312)
at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:699)
at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:630)
at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:531)
at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:502)
at ai.platon.pulsar.browser.driver.chrome.ChromeLauncher.prepareUserDataDir(ChromeLauncher.kt:236)
at ai.platon.pulsar.browser.driver.chrome.ChromeLauncher.launch(ChromeLauncher.kt:50)
at ai.platon.pulsar.browser.driver.chrome.ChromeLauncher.launch(ChromeLauncher.kt:61)
at ai.platon.pulsar.protocol.browser.driver.BrowserFactory.launchChromeDevtoolsBrowser(BrowserFactory.kt:40)
at ai.platon.pulsar.protocol.browser.driver.BrowserFactory.launch(BrowserFactory.kt:22)
at ai.platon.pulsar.protocol.browser.driver.BrowserManager.launchIfAbsent$lambda-12(BrowserManager.kt:106)
at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1705)
at ai.platon.pulsar.protocol.browser.driver.BrowserManager.launchIfAbsent(BrowserManager.kt:105)
at ai.platon.pulsar.protocol.browser.driver.BrowserManager.launch(BrowserManager.kt:38)
2022-05-22 16:49:04.067 WARN [r-worker-5] a.p.p.p.b.e.BrowserEmulator - Unexpected exception
java.lang.StringIndexOutOfBoundsException: String index out of range: 15
at java.base/java.lang.StringLatin1.charAt(StringLatin1.java:48)
at java.base/java.lang.String.charAt(String.java:711)
at ai.platon.pulsar.common.HtmlUtils.isBlankBody(Htmls.kt:107)
at ai.platon.pulsar.protocol.browser.emulator.EmulateEventHandler.checkHtmlIntegrity(EmulateEventHandler.kt:137)
at ai.platon.pulsar.protocol.browser.emulator.EmulateEventHandler.onAfterNavigate(EmulateEventHandler.kt:89)
at ai.platon.pulsar.protocol.browser.emulator.BrowserEmulator.browseWithMinorExceptionsHandled(BrowserEmulator.kt:180)
at ai.platon.pulsar.protocol.browser.emulator.BrowserEmulator.access$browseWithMinorExceptionsHandled(BrowserEmulator.kt:34)
at ai.platon.pulsar.protocol.browser.emulator.BrowserEmulator$browseWithMinorExceptionsHandled$1.invokeSuspend(BrowserEmulator.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
Doesn't work with chrome v111.
First reported at platonai/exotic-amazon#16 .
Our users reported the incompatible issue. We diagnosed the problem, and find out that chrome-111's browser_protocol.json and js_protocol.json have changed a lot.
Solution:
When run exotic-standalone on WSL, we see:
2022-05-20 15:51:31.107 INFO [r-worker-4] a.p.p.p.b.d.c.ChromeDevtoolsDriver - TypeError: document.body.HMNvqKforEach is not a function
at Function.HMNvqKutils__.updatePulsarStat (:300:23)
at Function.HMNvqKutils__.isActuallyReady (:237:19)
at Function.HMNvqKutils__.checkPulsarStatus (:178:31)
at Function.HMNvqKutils__.waitForReady (:151:26)
at :1:15
We thought it's caused by the js resources are loaded twice and the scriptNameCipher might be calculated twice for some reason:
2022-05-20 15:48:05.973 INFO [r-worker-1] a.p.p.c.ResourceLoader - Find resource js/pulsar_utils.js | jar:file:/home/vincent/workspace/exotic-standalone.jar!/BOOT-INF/lib/pulsar-browser-1.9.6.jar!/js/pulsar_utils.js
2022-05-20 15:48:05.973 INFO [r-worker-2] a.p.p.c.ResourceLoader - Find resource js/pulsar_utils.js | jar:file:/home/vincent/workspace/exotic-standalone.jar!/BOOT-INF/lib/pulsar-browser-1.9.6.jar!/js/pulsar_utils.js
I suggest that make sure the js resources are loaded only once and make sure the scriptNameCipher be unique in process scope.
Reading from FileBackendPageStore failed.
Exception in thread "main" java.nio.file.FileSystemException: C:\Users\Vincent Zhang.pulsar\data\store\nbzfcg-cn\nbzfcg-cn-5fb8f1e5b8322a31bb42dbdcee9d256f.avro: 另一个程序正在使用此文件,进程无法访问。
at java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92)
at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103)
at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:108)
at java.base/sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:274)
at java.base/sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:110)
at java.base/java.nio.file.Files.deleteIfExists(Files.java:1185)
at ai.platon.pulsar.persist.gora.FileBackendPageStore.readAvro(FileBackendPageStore.kt:98)
at ai.platon.pulsar.persist.gora.FileBackendPageStore.get(FileBackendPageStore.kt:41)
at ai.platon.pulsar.persist.gora.FileBackendPageStore.get(FileBackendPageStore.kt:30)
at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156)
at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56)
at ai.platon.pulsar.persist.WebDb.getOrNull(WebDb.kt:71)
at ai.platon.pulsar.persist.WebDb.getOrNull$default(WebDb.kt:65)
at ai.platon.pulsar.crawl.component.LoadComponent.createPageShell(LoadComponent.kt:259)
at ai.platon.pulsar.crawl.component.LoadComponent.loadDeferred0(LoadComponent.kt:206)
at ai.platon.pulsar.crawl.component.LoadComponent.loadWithRetryDeferred(LoadComponent.kt:116)
at ai.platon.pulsar.crawl.component.LoadComponent.loadDeferred(LoadComponent.kt:95)
at ai.platon.pulsar.context.support.AbstractPulsarContext.loadDeferred$suspendImpl(AbstractPulsarContext.kt:329)
at ai.platon.pulsar.context.support.AbstractPulsarContext.loadDeferred(AbstractPulsarContext.kt)
at ai.platon.pulsar.session.AbstractPulsarSession.loadAndCacheDeferred(AbstractPulsarSession.kt:487)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred$suspendImpl(AbstractPulsarSession.kt:192)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred(AbstractPulsarSession.kt)
at ai.platon.scent.dm.HarvestRunner.loadDeferred(HarvestRunner.kt:251)
at ai.platon.scent.dm.HarvestRunner.access$loadDeferred(HarvestRunner.kt:40)
at ai.platon.scent.dm.HarvestRunner$loadDocumentsDeferred$2$1$1.invokeSuspend(HarvestRunner.kt:275)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
PS C:\Users\Vincent Zhang.pulsar\proxy> java -version
openjdk version "14.0.2" 2020-07-14
OpenJDK Runtime Environment (build 14.0.2+12-46)
OpenJDK 64-Bit Server VM (build 14.0.2+12-46, mixed mode, sharing)
PS C:\Users\Vincent Zhang.pulsar\proxy> Get-ComputerInfo -Property “os*” | select OSName, OsArchitecture
OsName OsArchitecture
Microsoft Windows 11 家庭中文版 64 位
bin/build-run.sh
// OK
2023-02-16 22:12:09.736 INFO [main] a.p.p.a.m.PulsarMasterKt - Starting PulsarMasterKt v1.10.10-SNAPSHOT using Java 17.0.5 on regulus with PID 21576 (/home/vincent/workspace/pulsar-1.10.x/pulsar-app/pulsar-master/target/pulsar-master-1.10.10-SNAPSHOT.jar started by vincent in /home/vincent/workspace/pulsar-1.10.x)
And then we issue an X-SQL to scrape:
bin/scrape.sh
// The server issues the warning message
...
...
...
2023-02-16 22:14:14.929 WARN [r-worker-2] a.p.p.c.i.StreamingCrawler - [Unexpected]
java.lang.reflect.InaccessibleObjectException: Unable to make field private final long java.time.Duration.seconds accessible: module java.base does not "opens java.time" to unnamed module @62bd2070
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
at java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:178)
at java.base/java.lang.reflect.Field.setAccessible(Field.java:172)
at com.google.gson.internal.reflect.UnsafeReflectionAccessor.makeAccessible(UnsafeReflectionAccessor.java:44)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory.getBoundFields(ReflectiveTypeAdapterFactory.java:159)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory.create(ReflectiveTypeAdapterFactory.java:102)
at com.google.gson.Gson.getAdapter(Gson.java:489)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory.createBoundField(ReflectiveTypeAdapterFactory.java:117)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory.getBoundFields(ReflectiveTypeAdapterFactory.java:166)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory.create(ReflectiveTypeAdapterFactory.java:102)
at com.google.gson.Gson.getAdapter(Gson.java:489)
at com.google.gson.Gson.toJson(Gson.java:727)
at com.google.gson.Gson.toJson(Gson.java:714)
at com.google.gson.Gson.toJson(Gson.java:669)
at com.google.gson.Gson.toJson(Gson.java:649)
at ai.platon.pulsar.browser.common.InteractSettings.overrideConfiguration(BrowserSettings.kt:391)
at ai.platon.pulsar.common.options.LoadOptions.overrideConfiguration(LoadOptions.kt:691)
at ai.platon.pulsar.common.options.LoadOptions.overrideConfiguration(LoadOptions.kt:671)
at ai.platon.pulsar.common.urls.CombinedUrlNormalizer.createLoadOptions(CombinedUrlNormalizer.kt:49)
at ai.platon.pulsar.common.urls.CombinedUrlNormalizer.normalize(CombinedUrlNormalizer.kt:21)
at ai.platon.pulsar.context.support.AbstractPulsarContext.normalize(AbstractPulsarContext.kt:209)
at ai.platon.pulsar.session.AbstractPulsarSession.normalize(AbstractPulsarSession.kt:132)
at ai.platon.pulsar.session.PulsarSession$DefaultImpls.normalize$default(PulsarSession.kt:187)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred$suspendImpl(AbstractPulsarSession.kt:185)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred(AbstractPulsarSession.kt)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.loadWithMinorExceptionHandled(StreamingCrawler.kt:475)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.access$loadWithMinorExceptionHandled(StreamingCrawler.kt:66)
at ai.platon.pulsar.crawl.impl.StreamingCrawler$loadWithTimeout$2.invokeSuspend(StreamingCrawler.kt:395)
at ai.platon.pulsar.crawl.impl.StreamingCrawler$loadWithTimeout$2.invoke(StreamingCrawler.kt)
at ai.platon.pulsar.crawl.impl.StreamingCrawler$loadWithTimeout$2.invoke(StreamingCrawler.kt)
at kotlinx.coroutines.intrinsics.UndispatchedKt.startUndispatchedOrReturnIgnoreTimeout(Undispatched.kt:100)
at kotlinx.coroutines.TimeoutKt.setupTimeout(Timeout.kt:146)
at kotlinx.coroutines.TimeoutKt.withTimeout(Timeout.kt:44)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.loadWithTimeout(StreamingCrawler.kt:394)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.runLoadTaskWithEventHandlers(StreamingCrawler.kt:376)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.access$runLoadTaskWithEventHandlers(StreamingCrawler.kt:66)
at ai.platon.pulsar.crawl.impl.StreamingCrawler$runWithStatusCheck$2.invokeSuspend(StreamingCrawler.kt:354)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:570)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:677)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:664)
2022-11-21 10:52:06.400 WARN [main] a.p.p.p.b.e.i.BrowserEmulatorImplBase - java.nio.file.FileSystemException: C:\Users\VINCEN~1\AppData\Local\Temp\ln\eff06eec5f83b35c9737f4d3f0f153f7.htm: 客户端没有所需的特权。
15:26:41.436 [-worker-15] WARN a.p.p.p.b.e.i.BrowserEmulatedFetcherImpl - [Unexpected] Failed to visit page | https://www.google.de/search?q=Favorite+World%2C+LLC+Telefon
java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.StringUTF16.compress(StringUTF16.java:168)
at java.base/java.lang.StringUTF16.newString(StringUTF16.java:1019)
at java.base/java.lang.StringBuilder.toString(StringBuilder.java:453)
at ai.platon.pulsar.protocol.browser.emulator.impl.BrowserEmulatorImplBase.createResponse(BrowserEmulatorImplBase.kt:89)
at ai.platon.pulsar.protocol.browser.emulator.impl.InteractiveBrowserEmulator.browseWithWebDriver(InteractiveBrowserEmulator.kt:325)
at ai.platon.pulsar.protocol.browser.emulator.impl.InteractiveBrowserEmulator.access$browseWithWebDriver(InteractiveBrowserEmulator.kt:37)
at ai.platon.pulsar.protocol.browser.emulator.impl.InteractiveBrowserEmulator$browseWithWebDriver$1.invokeSuspend(InteractiveBrowserEmulator.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
I use ScentSQLContext to executeQuery, foreach urls list in sqlFunction: load_and_select(@url,'css')
the java heap go up by times and never down , and then two hours later , it throw OutOfMemoryError.
I have execute rs.close() , but the java heap go head.
the dump file shows NodeList contains many byte[], such as :
How can I resolve this issue ?
The following warnings appears every time we run PulsarRPA examples/demos/services:
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by javassist.util.proxy.SecurityActions (file:/C:/Users/pereg/.m2/repository/javassist/javassist/3.12.1.GA/javassist-3.12.1.GA.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of javassist.util.proxy.SecurityActions
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
A typical error message:
This page isn’t working
wrd.walmart.com didn’t send any data.
ERR_EMPTY_RESPONSE
It seems that the proxy ip has been blocked by the website.
The package com.sun.jna comes with kotlin, but there is no close() method in com.sun.jna.Memory.
2023-09-11 21:47:40.489 INFO [main] a.p.p.p.b.e.c.BasicPrivacyContextManager - Privacy context is created #091119IXKO1
java.lang.NoSuchMethodError: com.sun.jna.Memory.close()V
at oshi.util.Util.freeMemory(Util.java:83)
at oshi.jna.ByRef$CloseableHANDLEByReference.close(ByRef.java:95)
at oshi.software.os.windows.WindowsOperatingSystem.enableDebugPrivilege(WindowsOperatingSystem.java:469)
at oshi.software.os.windows.WindowsOperatingSystem.(WindowsOperatingSystem.java:105)
at oshi.SystemInfo.createOperatingSystem(SystemInfo.java:82)
at oshi.util.Memoizer$1.get(Memoizer.java:61)
at oshi.SystemInfo.getOperatingSystem(SystemInfo.java:76)
at ai.platon.pulsar.common.AppSystemInfo$Companion.isOSHIAvailable(AppSystemInfo.kt:132)
at ai.platon.pulsar.common.AppSystemInfo.(AppSystemInfo.kt:30)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.shouldCreateWebDriver(LoadingWebDriverPool.kt:370)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.resourceSafeCreateDriverIfNecessary(LoadingWebDriverPool.kt:334)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.pollWebDriver(LoadingWebDriverPool.kt:314)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.poll(LoadingWebDriverPool.kt:202)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.poll(LoadingWebDriverPool.kt:197)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.pollWithEvents(LoadingWebDriverPool.kt:304)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.poll(LoadingWebDriverPool.kt:224)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.runWithDriverPool(WebDriverPoolManager.kt:493)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.access$runWithDriverPool(WebDriverPoolManager.kt:32)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager$runWithDriverPool$2.invokeSuspend(WebDriverPoolManager.kt:461)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager$runWithDriverPool$2.invoke(WebDriverPoolManager.kt)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager$runWithDriverPool$2.invoke(WebDriverPoolManager.kt)
at ai.platon.pulsar.common.PreemptChannelSupport.whenNormalDeferred(PreemptChannelSupport.kt:58)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.runWithDriverPool(WebDriverPoolManager.kt:449)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.doRun(WebDriverPoolManager.kt:398)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.run(WebDriverPoolManager.kt:157)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.run(WebDriverPoolManager.kt:137)
at ai.platon.pulsar.protocol.browser.emulator.context.WebDriverContext.run(WebDriverContext.kt:77)
at ai.platon.pulsar.protocol.browser.emulator.context.BrowserPrivacyContext.doRun$suspendImpl(BrowserPrivacyContext.kt:69)
at ai.platon.pulsar.protocol.browser.emulator.context.BrowserPrivacyContext.doRun(BrowserPrivacyContext.kt)
at ai.platon.pulsar.crawl.fetch.privacy.PrivacyContext.run$suspendImpl(PrivacyContext.kt:287)
at ai.platon.pulsar.crawl.fetch.privacy.PrivacyContext.run(PrivacyContext.kt)
at ai.platon.pulsar.protocol.browser.emulator.context.BasicPrivacyContextManager.run1(BasicPrivacyContextManager.kt:92)
at ai.platon.pulsar.protocol.browser.emulator.context.BasicPrivacyContextManager.run0(BasicPrivacyContextManager.kt:80)
at ai.platon.pulsar.protocol.browser.emulator.context.BasicPrivacyContextManager.run(BasicPrivacyContextManager.kt:34)
at ai.platon.pulsar.protocol.browser.emulator.impl.BrowserEmulatedFetcherImpl.fetchTaskDeferred(BrowserEmulatedFetcherImpl.kt:93)
at ai.platon.pulsar.protocol.browser.emulator.impl.BrowserEmulatedFetcherImpl.fetchContentDeferred$suspendImpl(BrowserEmulatedFetcherImpl.kt:80)
at ai.platon.pulsar.protocol.browser.emulator.impl.BrowserEmulatedFetcherImpl.fetchContentDeferred(BrowserEmulatedFetcherImpl.kt)
at ai.platon.pulsar.protocol.browser.emulator.impl.BrowserEmulatedFetcherImpl$fetchContent$1.invokeSuspend(BrowserEmulatedFetcherImpl.kt:57)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:284)
at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85)
at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
at ai.platon.pulsar.protocol.browser.emulator.impl.BrowserEmulatedFetcherImpl.fetchContent(BrowserEmulatedFetcherImpl.kt:56)
at ai.platon.pulsar.protocol.browser.BrowserEmulatorProtocol.getResponse(BrowserEmulatorProtocol.kt:43)
at ai.platon.pulsar.crawl.protocol.http.AbstractHttpProtocol.getProtocolOutputWithRetry(AbstractHttpProtocol.kt:118)
at ai.platon.pulsar.crawl.protocol.http.AbstractHttpProtocol.getProtocolOutput(AbstractHttpProtocol.kt:88)
at ai.platon.pulsar.crawl.component.FetchComponent.fetchContent0(FetchComponent.kt:108)
at ai.platon.pulsar.crawl.component.FetchComponent.fetchContent(FetchComponent.kt:75)
at ai.platon.pulsar.crawl.component.LoadComponent.fetchContent(LoadComponent.kt:505)
at ai.platon.pulsar.crawl.component.LoadComponent.fetchContentIfNecessary(LoadComponent.kt:266)
at ai.platon.pulsar.crawl.component.LoadComponent.load1(LoadComponent.kt:233)
at ai.platon.pulsar.crawl.component.LoadComponent.load0(LoadComponent.kt:227)
at ai.platon.pulsar.crawl.component.LoadComponent.loadWithRetry(LoadComponent.kt:129)
at ai.platon.pulsar.crawl.component.LoadComponent.load(LoadComponent.kt:117)
at ai.platon.pulsar.context.support.AbstractPulsarContext.load(AbstractPulsarContext.kt:367)
at ai.platon.pulsar.session.AbstractPulsarSession.loadAndCache(AbstractPulsarSession.kt:493)
at ai.platon.pulsar.session.AbstractPulsarSession.load(AbstractPulsarSession.kt:184)
at ai.platon.pulsar.session.AbstractPulsarSession.load(AbstractPulsarSession.kt:171)
at ai.platon.pulsar.session.AbstractPulsarSession.load(AbstractPulsarSession.kt:169)
at ai.platon.pulsar.examples._0_BasicUsageKt.main(_0_BasicUsage.kt:17)
at ai.platon.pulsar.examples._0_BasicUsageKt.main(_0_BasicUsage.kt)
Mongodb is already closed before MiscMessageWriter.close in which WebDb.flush is called. This happens when embeded mongodb is started in Exotic.
A possible solution is to remove the WebDb dependency by MiscMessageWriter.
2022-05-29 20:14:20.643 ERROR [utdownHook] a.p.p.p.WebDb - ai.platon.shaded.com.mongodb.MongoSocketReadException: Prematurely reached end of stream
at ai.platon.shaded.com.mongodb.internal.connection.SocketStream.read(SocketStream.java:112)
at ai.platon.shaded.com.mongodb.internal.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:579)
at ai.platon.shaded.com.mongodb.internal.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:444)
at ai.platon.shaded.com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:298)
at ai.platon.shaded.com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:258)
at ai.platon.shaded.com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:99)
at ai.platon.shaded.com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:450)
at ai.platon.shaded.com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:72)
at ai.platon.shaded.com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:226)
at ai.platon.shaded.com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:269)
at ai.platon.shaded.com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:131)
at ai.platon.shaded.com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:123)
at ai.platon.shaded.com.mongodb.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:343)
at ai.platon.shaded.com.mongodb.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:334)
at ai.platon.shaded.com.mongodb.operation.CommandOperationHelper.executeCommandWithConnection(CommandOperationHelper.java:220)
at ai.platon.shaded.com.mongodb.operation.CommandOperationHelper$5.call(CommandOperationHelper.java:206)
at ai.platon.shaded.com.mongodb.operation.OperationHelper.withReadConnectionSource(OperationHelper.java:463)
at ai.platon.shaded.com.mongodb.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:203)
at ai.platon.shaded.com.mongodb.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:198)
at ai.platon.shaded.com.mongodb.operation.CommandReadOperation.execute(CommandReadOperation.java:59)
at ai.platon.shaded.com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:194)
at ai.platon.shaded.com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:175)
at ai.platon.shaded.com.mongodb.DB.executeCommand(DB.java:775)
at ai.platon.shaded.com.mongodb.DB.command(DB.java:521)
at ai.platon.shaded.com.mongodb.DB.command(DB.java:537)
at ai.platon.shaded.com.mongodb.DB.command(DB.java:492)
at ai.platon.shaded.com.mongodb.Mongo.fsync(Mongo.java:648)
at org.apache.gora.mongodb.store.MongoStore.flush(MongoStore.java:294)
at ai.platon.pulsar.persist.WebDb.flush(WebDb.kt:261)
at ai.platon.pulsar.common.message.MiscMessageWriter.commit(MiscMessageWriter.kt:305)
at ai.platon.pulsar.common.message.MiscMessageWriter.close(MiscMessageWriter.kt:310)
at org.springframework.beans.factory.support.DisposableBeanAdapter.destroy(DisposableBeanAdapter.java:239)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroyBean(DefaultSingletonBeanRegistry.java:587)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingleton(DefaultSingletonBeanRegistry.java:559)
at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingleton(DefaultListableBeanFactory.java:1161)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingletons(DefaultSingletonBeanRegistry.java:520)
at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingletons(DefaultListableBeanFactory.java:1154)
at org.springframework.context.support.AbstractApplicationContext.destroyBeans(AbstractApplicationContext.java:1106)
at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1075)
at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.doClose(ServletWebServerApplicationContext.java:172)
at org.springframework.context.support.AbstractApplicationContext$1.run(AbstractApplicationContext.java:991)
The following queries are failed:
document.selectHyperlinks('[href=/dp/]')
ele.selectHyperlinks('[href=/dp/]')
the following queiries are supported by chrome devtools, bug not sure they are standard or not, they are also failed:
document.selectHyperlinks('[href*=/dp/]')
ele.selectHyperlinks('[href*=/dp/]')
Some websites use selectors what do not match the standard. For example,
<div class='KAHaP+'></div>
the charactor "+" is not allowed in a class name so Jsoup throws a SelectorParseException, and pulsar-dom throws a PowerSelectorParseException.
We found the issue when handle with jd.com and shopee.sg.
Jsoup follows the CSS2 value defination standard:
https://www.w3.org/TR/CSS2/syndata.html#value-def-identifier
In CSS, identifiers (including element names, classes, and IDs in [selectors](https://www.w3.org/TR/CSS2/selector.html)) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B\&W\?" or "B\26 W\3F".
For more about valid characters in a CSS selector:
https://pineco.de/css-quick-tip-the-valid-characters-in-a-custom-css-selector/
A selector will look something like this:
-?[_a-zA-Z]+[_-a-zA-Z0-9]*
Failed to create web driver pulsar_chrome, caused by "Using unsafe HTTP verb GET to invoke /json/new. This action supports only PUT verb."
15:39:53.165 [r-worker-2] INFO a.p.pulsar.common.ProcessLauncher - Launching process:
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --headless --disable-gpu --hide-scrollbars --remote-debugging-port=0 --no-default-browser-check --no-first-run --no-startup-window --mute-audio --disable-background-networking --disable-background-timer-throttling --disable-client-side-phishing-detection --disable-hang-monitor --disable-popup-blocking --disable-prompt-on-repost --disable-sync --disable-translate --disable-blink-features=AutomationControlled --metrics-recording-only --safebrowsing-disable-auto-update --no-sandbox --ignore-certificate-errors --window-size=1920,1080 --pageLoadStrategy=none --throwExceptionOnScriptError=true --user-data-dir=/var/folders/vr/_8xgwfn14959gb617jpn7gv40000gp/T/pulsar-kust/context/browser/br.2jede
15:39:53.487 [r-worker-2] ERROR a.p.p.p.b.driver.WebDriverFactory - Failed to create web driver pulsar_chrome
ai.platon.pulsar.protocol.browser.DriverLaunchException: Failed to create chrome devtools driver
at ai.platon.pulsar.protocol.browser.driver.cdt.ChromeDevtoolsDriver.(ChromeDevtoolsDriver.kt:110)
at ai.platon.pulsar.protocol.browser.driver.WebDriverFactory.createChromeDevtoolsDriver(WebDriverFactory.kt:80)
at ai.platon.pulsar.protocol.browser.driver.WebDriverFactory.create(WebDriverFactory.kt:44)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.createDriverIfNecessary(LoadingWebDriverPool.kt:226)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.poll0(LoadingWebDriverPool.kt:204)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.poll(LoadingWebDriverPool.kt:118)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.poll(LoadingWebDriverPool.kt:113)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.firstLaunch(WebDriverPoolManager.kt:255)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.access$firstLaunch(WebDriverPoolManager.kt:40)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager$run0$2.invokeSuspend(WebDriverPoolManager.kt:211)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager$run0$2.invoke(WebDriverPoolManager.kt)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager$run0$2.invoke(WebDriverPoolManager.kt)
at ai.platon.pulsar.common.PreemptChannelSupport.whenNormalDeferred(PreemptChannelSupport.kt:59)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.run0(WebDriverPoolManager.kt:194)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.run(WebDriverPoolManager.kt:105)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.run(WebDriverPoolManager.kt:101)
at ai.platon.pulsar.protocol.browser.emulator.context.WebDriverContext.run(BrowserContexts.kt:60)
at ai.platon.pulsar.protocol.browser.emulator.context.BrowserPrivacyContext.doRun$suspendImpl(BrowserPrivacyContext.kt:43)
at ai.platon.pulsar.protocol.browser.emulator.context.BrowserPrivacyContext.doRun(BrowserPrivacyContext.kt)
at ai.platon.pulsar.crawl.fetch.privacy.PrivacyContext.run$suspendImpl(PrivacyContext.kt:118)
at ai.platon.pulsar.crawl.fetch.privacy.PrivacyContext.run(PrivacyContext.kt)
at ai.platon.pulsar.protocol.browser.emulator.context.MultiPrivacyContextManager.run0(MultiPrivacyContextManager.kt:118)
at ai.platon.pulsar.protocol.browser.emulator.context.MultiPrivacyContextManager.run(MultiPrivacyContextManager.kt:101)
at ai.platon.pulsar.protocol.browser.emulator.context.MultiPrivacyContextManager.run(MultiPrivacyContextManager.kt:54)
at ai.platon.pulsar.protocol.browser.emulator.BrowserEmulatedFetcher.fetchTaskDeferred(BrowserEmulatedFetcher.kt:76)
at ai.platon.pulsar.protocol.browser.emulator.BrowserEmulatedFetcher.fetchContentDeferred(BrowserEmulatedFetcher.kt:69)
at ai.platon.pulsar.protocol.browser.BrowserEmulatorProtocol.getResponseDeferred(BrowserEmulatorProtocol.kt:49)
at ai.platon.pulsar.crawl.protocol.http.AbstractHttpProtocol.getProtocolOutputDeferred$suspendImpl(AbstractHttpProtocol.kt:101)
at ai.platon.pulsar.crawl.protocol.http.AbstractHttpProtocol.getProtocolOutputDeferred(AbstractHttpProtocol.kt)
at ai.platon.pulsar.crawl.component.FetchComponent.fetchContentDeferred0(FetchComponent.kt:133)
at ai.platon.pulsar.crawl.component.FetchComponent.fetchContentDeferred(FetchComponent.kt:95)
at ai.platon.pulsar.crawl.component.LoadComponent.fetchContentDeferred(LoadComponent.kt:442)
at ai.platon.pulsar.crawl.component.LoadComponent.fetchContentIfNecessaryDeferred(LoadComponent.kt:232)
at ai.platon.pulsar.crawl.component.LoadComponent.loadDeferred1(LoadComponent.kt:217)
at ai.platon.pulsar.crawl.component.LoadComponent.loadDeferred0(LoadComponent.kt:211)
at ai.platon.pulsar.crawl.component.LoadComponent.loadWithRetryDeferred(LoadComponent.kt:107)
at ai.platon.pulsar.crawl.component.LoadComponent.loadDeferred(LoadComponent.kt:94)
at ai.platon.pulsar.context.support.AbstractPulsarContext.loadDeferred$suspendImpl(AbstractPulsarContext.kt:326)
at ai.platon.pulsar.context.support.AbstractPulsarContext.loadDeferred(AbstractPulsarContext.kt)
at ai.platon.pulsar.session.AbstractPulsarSession.loadAndCacheDeferred(AbstractPulsarSession.kt:207)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred$suspendImpl(AbstractPulsarSession.kt:197)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred(AbstractPulsarSession.kt)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred$suspendImpl(AbstractPulsarSession.kt:190)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred(AbstractPulsarSession.kt)
at ai.platon.pulsar.crawl.StreamingCrawler.loadWithEventHandlers(StreamingCrawler.kt:520)
at ai.platon.pulsar.crawl.StreamingCrawler.loadUrl(StreamingCrawler.kt:416)
at ai.platon.pulsar.crawl.StreamingCrawler.runUrlTask(StreamingCrawler.kt:405)
at ai.platon.pulsar.crawl.StreamingCrawler.access$runUrlTask(StreamingCrawler.kt:68)
at ai.platon.pulsar.crawl.StreamingCrawler$runWithStatusCheck$2.invokeSuspend(StreamingCrawler.kt:379)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
Caused by: ai.platon.pulsar.browser.driver.chrome.util.WebSocketServiceException: Received error (405) - Method Not Allowed
Using unsafe HTTP verb GET to invoke /json/new. This action supports only PUT verb.
at ai.platon.pulsar.browser.driver.chrome.impl.Chrome.request(Chrome.kt:157)
at ai.platon.pulsar.browser.driver.chrome.impl.Chrome.createTab(Chrome.kt:66)
at ai.platon.pulsar.protocol.browser.driver.cdt.ChromeDevtoolsBrowserInstance.createTab(ChromeDevtoolsBrowserInstance.kt:45)
at ai.platon.pulsar.protocol.browser.driver.cdt.ChromeDevtoolsDriver.(ChromeDevtoolsDriver.kt:97)
... 54 common frames omitted
15:39:53.489 [r-worker-2] WARN a.p.pulsar.crawl.StreamingCrawler - Failed to create web driver | pulsar_chrome
比如小红书的网页版,只支持手机验证码和扫码,这种情况下要如何才能先登录再进行浏览呢?
The inactive privacy context was not closed properly.
13:00:14.018 [r-worker-1] INFO a.p.p.p.b.e.c.BrowserPrivacyContext - Privacy context #10102H5yL71 has lived for 2h59m33s | success: 1248(0.12 pages/s) | small: 1(0.1%) | traffic: 376.43 MiB(35.70 KiB/s) | tasks: 1290 total run: 1284 | null
Would be nice if it has at least a dockerfile in order to be distribuited at the concept level of the real word.
As well for security purposes of any kind.
At least.
Exception in thread "DefaultDispatcher-worker-1 @sc#1" java.nio.file.FileSystemException: /run/user/1000/doc: Operation not permitted at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) at java.base/sun.nio.fs.UnixFileStore.readAttributes(UnixFileStore.java:115) at java.base/sun.nio.fs.UnixFileStore.getTotalSpace(UnixFileStore.java:122) at ai.platon.pulsar.common.metrics.AppMetrics$Companion.getFreeSpace(AppMetrics.kt:107)
`vincent@vincent-KLVC-WXX9:~/workspace/pulsar-1.10.x$ java -version
openjdk version "11.0.16" 2022-07-19
OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)
vincent@vincent-KLVC-WXX9:~$ uname -a
Linux vincent-KLVC-WXX9 5.15.0-53-generic #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
`
Should properly handle the files without correct permition:
FileSystems.getDefault().fileStores .filter { ByteUnitConverter.convert(totalSpaceOr0(it), "G") > 20 } .map { unallocatedSpaceOr0(it) } .filter { it > 0 }
使用最新的Google Chrome时:
使用正常标题的google-chrome浏览器时:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36
使用google-chrome-headless浏览器时:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/95.0.4638.69 Safari/537.36
Too many warning logs after MongoDB crashes:
10:56:03.617 [r-worker-5] WARN a.p.p.c.i.StreamingCrawler - [Unexpected]
ai.platon.shaded.com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ai.platon.shaded.com.mongodb.client.internal.MongoClientDeleg
ate$1@7335a5ec. Client view of cluster state is {type=STANDALONE, servers=[{address=127.0.0.1:27017, type=UNKNOWN, state=CONNECTING, exception={ai.platon.shaded.com.mongodb.Mongo
SocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused (Connection refused)}}]
at ai.platon.shaded.com.mongodb.internal.connection.BaseCluster.createTimeoutException(BaseCluster.java:408)
at ai.platon.shaded.com.mongodb.internal.connection.BaseCluster.selectServer(BaseCluster.java:123)
at ai.platon.shaded.com.mongodb.internal.connection.AbstractMultiServerCluster.selectServer(AbstractMultiServerCluster.java:54)
at ai.platon.shaded.com.mongodb.client.internal.MongoClientDelegate.getConnectedClusterDescription(MongoClientDelegate.java:152)
at ai.platon.shaded.com.mongodb.client.internal.MongoClientDelegate.createClientSession(MongoClientDelegate.java:102)
at ai.platon.shaded.com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.getClientSession(MongoClientDelegate.java:282)
at ai.platon.shaded.com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:206)
at ai.platon.shaded.com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:180)
at ai.platon.shaded.com.mongodb.DBCollection.executeWriteOperation(DBCollection.java:356)
at ai.platon.shaded.com.mongodb.DBCollection.update(DBCollection.java:588)
at ai.platon.shaded.com.mongodb.DBCollection.update(DBCollection.java:507)
at ai.platon.shaded.com.mongodb.DBCollection.update(DBCollection.java:482)
at ai.platon.shaded.com.mongodb.DBCollection.update(DBCollection.java:459)
at ai.platon.shaded.com.mongodb.DBCollection.update(DBCollection.java:527)
at org.apache.gora.mongodb.store.MongoStore.performPut(MongoStore.java:380)
at org.apache.gora.mongodb.store.MongoStore.put(MongoStore.java:345)
at org.apache.gora.mongodb.store.MongoStore.put(MongoStore.java:70)
at ai.platon.pulsar.persist.WebDb.putInternal(WebDb.kt:134)
at ai.platon.pulsar.persist.WebDb.put(WebDb.kt:109)
at ai.platon.pulsar.persist.WebDb.put$default(WebDb.kt:109)
at ai.platon.pulsar.crawl.component.LoadComponent.persist(LoadComponent.kt:575)
at ai.platon.pulsar.crawl.component.LoadComponent.onLoaded(LoadComponent.kt:371)
at ai.platon.pulsar.crawl.component.LoadComponent.loadDeferred1(LoadComponent.kt:231)
at ai.platon.pulsar.crawl.component.LoadComponent.access$loadDeferred1(LoadComponent.kt:41)
at ai.platon.pulsar.crawl.component.LoadComponent$loadDeferred1$1.invokeSuspend(LoadComponent.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.internal.ScopeCoroutine.afterResume(Scopes.kt:33)
at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
Original report:
The proxy is expired (xxx), context reset will be triggered
platonai/exotic-amazon#19
This bug has already be fixed in 1.10.x.
Too many RobustRPC logs, for example:
2023-09-12 14:07:43.898 INFO [-worker-37] a.p.p.p.b.d.c.d.RobustRPC - [scrollTo] (3/5) | -32000, DOM Error while querying
2023-09-12 14:07:44.072 INFO [-worker-51] a.p.p.p.b.d.c.d.RobustRPC - [scrollTo] (3/5) | -32000, DOM Error while querying
2023-09-12 14:07:45.095 INFO [-worker-23] a.p.p.p.b.d.c.d.RobustRPC - [scrollTo] (3/5) | -32000, DOM Error while querying
I use StaticH2SQLContext().executeQuery(sql) in ide, it work complete.
but when I use mvn package and java -jar to start project. it throw this problem:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:108)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:58)
at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:88)
Caused by: org.h2.jdbc.JdbcSQLException: Function "DOM_LOCATION" not found; SQL statement:
select xxx from xxx
at org.h2.message.DbException.getJdbcSQLException(DbException.java:357)
at org.h2.message.DbException.get(DbException.java:179)
at org.h2.message.DbException.get(DbException.java:155)
at org.h2.command.Parser.readJavaFunction(Parser.java:2699)
at org.h2.command.Parser.readFunction(Parser.java:2756)
at org.h2.command.Parser.readTerm(Parser.java:3102)
at org.h2.command.Parser.readFactor(Parser.java:2587)
at org.h2.command.Parser.readSum(Parser.java:2574)
at org.h2.command.Parser.readConcat(Parser.java:2544)
at org.h2.command.Parser.readCondition(Parser.java:2370)
at org.h2.command.Parser.readAnd(Parser.java:2342)
at org.h2.command.Parser.readExpression(Parser.java:2334)
at org.h2.command.Parser.parseSelectSimpleSelectPart(Parser.java:2245)
at org.h2.command.Parser.parseSelectSimple(Parser.java:2277)
at org.h2.command.Parser.parseSelectSub(Parser.java:2133)
at org.h2.command.Parser.parseSelectUnion(Parser.java:1946)
at org.h2.command.Parser.parseSelect(Parser.java:1919)
at org.h2.command.Parser.parsePrepared(Parser.java:463)
at org.h2.command.Parser.parse(Parser.java:335)
at org.h2.command.Parser.parse(Parser.java:307)
at org.h2.command.Parser.prepareCommand(Parser.java:278)
at org.h2.engine.Session.prepareLocal(Session.java:626)
at org.h2.engine.Session.prepareCommand(Session.java:564)
at org.h2.jdbc.JdbcConnection.prepareCommand(JdbcConnection.java:1247)
at org.h2.jdbc.JdbcStatement.executeQuery(JdbcStatement.java:78)
at ai.platon.pulsar.ql.context.AbstractSQLContext.executeQuery(AbstractSQLContext.kt:89)
... 10 more
when I use ScentSQLContext.create() to package , it work complete, But SQLContexts.create() is error.
How can I use SQLContext to package and work successfully ?
<mirror>
<id>aliyunmaven</id>
<mirrorOf>central</mirrorOf>
<name>阿里云公共仓库</name>
<url>https://maven.aliyun.com/repository/public</url>
</mirror>
<mirror>
<id>spring</id>
<mirrorOf>central</mirrorOf>
<name>spring公共仓库</name>
<url>https://maven.aliyun.com/repository/spring</url>
</mirror>
<mirror>
<id>repo</id>
<mirrorOf>central</mirrorOf>
<name>Human Readable Name for this Mirror.</name>
<url>https://repo.maven.apache.org/maven2/</url>
</mirror>
<mirror>
<id>repo2</id>
<mirrorOf>central</mirrorOf>
<name>Human Readable Name for this Mirror.</name>
<url>https://oss.sonatype.org/#stagingRepositories</url>
</mirror>
<mirror>
<id>repo3</id>
<mirrorOf>central</mirrorOf>
<name>Human Readable Name for this Mirror.</name>
<url>https://repo1.maven.org/maven2/ai/platon/pulsar</url>
</mirror>
<mirror>
<id>platonic</id>
<mirrorOf>public</mirrorOf>
<name>platonic公共仓库</name>
<url>http://static.platonic.fun/repo/</url>
</mirror>
<mirror>
<id>maven-default-http-blocker</id>
<mirrorOf>dummy</mirrorOf>
<name>Dummy mirror to override default blocking mirror that blocks http</name>
<url>http://0.0.0.0/</url>
</mirror>
18:56:56.985 [main] WARN a.p.pulsar.dom.select.PowerSelector - Failed to parse css query | #productDescription, h2:contains(Product Description) --x-- div | https://www.amazon.com/dp/B07V2CLJLV | Could not parse query '--x--': unexpected token at '--x--'
Failed to load proxy.providers.txt by multiple threads in parallel. The file should be locked.
17:02:17.125 [-worker-33] WARN a.p.p.c.proxy.ProxyLoader - Failed to load - /home/platonai/.pulsar/proxy/providers-enabled/proxy.providers.txt
17:02:17.125 [-worker-30] WARN a.p.p.c.proxy.ProxyLoader - Failed to load - /home/platonai/.pulsar/proxy/providers-enabled/proxy.providers.txt
17:02:17.125 [r-worker-8] WARN a.p.p.c.proxy.ProxyLoader - Failed to load - /home/platonai/.pulsar/proxy/providers-enabled/proxy.providers.txt
driver.allTexts() 这个方法返回的是List ,但是我调用后,结果并不是,好像list中嵌套了另一层list
val logisticsInfoList=driver.allTexts(".logistics-info-mod__header___2_fWN")
println("logisticsInfoList="+logisticsInfoList)
打印结果是 logisticsInfoList=[["菜鸟直送(丹鸟KD):621089810336681","申通快递:773260085378001"]]
Pulsar have to know the chrome home to play with it, a user specified path is required.
By default, pulsar searches the following paths for google chrome:
val CHROME_BINARY_SEARCH_PATHS = arrayOf(
"/usr/bin/google-chrome-stable",
"/usr/bin/google-chrome",
"/opt/google/chrome/chrome",
"C:/Program Files (x86)/Google/Chrome/Application/chrome.exe",
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
"/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary",
"/Applications/Chromium.app/Contents/MacOS/Chromium",
"/usr/bin/chromium",
"/usr/bin/chromium-browser"
)
For example, we construct a page URL does not exist: https://www.amazon.com/dp/006323047_404.
PulsarR has to properly handle such pages:
bin/build.sh ->
[INFO] Running ai.platon.pulsar.common.sql.TestSQLTemplate
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 s - in ai.platon.pulsar.common.sql.TestSQLTemplate
[INFO]
[INFO] Results:
[INFO]
[ERROR] Errors:
[ERROR] TestAppRuntimes.testDeleteBrokenSymbolicLinksUsingJava:92->testDeleteBrokenSymbolicLinksUsingJava$lambda-14:92 » FileSystem
[INFO]
[ERROR] Tests run: 149, Failures: 0, Errors: 1, Skipped: 8
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Pulsar 1.8.4-SNAPSHOT:
[INFO]
[INFO] Pulsar ............................................. SUCCESS [ 1.669 s]
[INFO] Pulsar Common ...................................... FAILURE [ 44.509 s]
[INFO] Pulsar Third ....................................... SKIPPED
SLF4J issues warnings:
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Connection timed out when closing staging repository:
Uploaded to ossrh: https://oss.sonatype.org:443/service/local/staging/deployByRepositoryId/aiplatonpulsar-1066/ai/platon/pulsar/pulsar-protocol/1.10.16/pulsar-protocol-1.10.16.pom.asc (659 B at 805 B/s)
[INFO] * Upload of locally staged artifacts finished.
[INFO] * Closing staging repository with ID "aiplatonpulsar-1066".
[ERROR] Remote staging finished with a failure: java.net.SocketException: Connection timed out (Read failed)
The logic is to close the oldest driver pool when there are too many retired web drivers.
This issue is caused by the following error: when counting retired web drivers, all retired drivers in all driver pools should be counted.
Exception in thread "DefaultDispatcher-worker-7" java.lang.NullPointerException
at ai.platon.pulsar.crawl.impl.StreamingCrawler.handleCanceled(StreamingCrawler.kt:649)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.handleRetry(StreamingCrawler.kt:541)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.runLoadTaskWithEventHandlers(StreamingCrawler.kt:462)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.access$runLoadTaskWithEventHandlers(StreamingCrawler.kt:68)
at ai.platon.pulsar.crawl.impl.StreamingCrawler$runLoadTaskWithEventHandlers$1.invokeSuspend(StreamingCrawler.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.internal.ScopeCoroutine.afterResume(Scopes.kt:33)
at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
at kotlinx.coroutines.internal.ScopeCoroutine.afterResume(Scopes.kt:33)
at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
Add a new method canConnect or equivent to Pulsar Driver.
我找了一圈代码,没有找到设置浏览器路径的接口,通过system.setproperty()也不起作用,现在打开浏览器就会自动更新到122版本,降级到104就没问问题,
The manual login method is:
Alternatively, copy the browser environment you use daily to the corresponding subdirectory under ~/.pulsar.
After copying, in directory ~/.pulsar/browser/chrome/prototype/google-chrome
, there should be the following files:
PS C:\Users\pereg\.pulsar\browser\chrome\prototype\google-chrome> ls
Directory: C:\Users\pereg\.pulsar\browser\chrome\prototype\google-chrome
Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 2023/11/5 17:32 AutofillStates
d---- 2023/12/14 16:23 BrowserMetrics
d---- 2023/11/5 18:04 CertificateRevocation
d---- 2023/12/4 22:36 component_crx_cache
d---- 2023/10/27 9:43 Crashpad
d---- 2023/11/1 16:22 Crowd Deny
d---- 2023/12/14 16:25 Default
d---- 2023/11/1 16:38 extensions_crx_cache
d---- 2023/11/1 13:20 FileTypePolicies
d---- 2023/11/5 17:32 FirstPartySetsPreloaded
d---- 2023/10/27 9:43 GraphiteDawnCache
d---- 2023/11/5 18:04 GrShaderCache
d---- 2023/11/5 17:19 hyphen-data
d---- 2023/11/1 16:41 Local Traces
d---- 2023/10/27 9:43 MediaFoundationWidevineCdm
d---- 2023/10/27 9:43 MEIPreload
d---- 2023/10/27 9:43 OnDeviceHeadSuggestModel
d---- 2023/12/6 9:51 OptimizationGuidePredictionModels
d---- 2023/12/6 9:51 OptimizationHints
d---- 2023/11/1 13:24 OriginTrials
d---- 2023/12/6 9:51 PKIMetadata
d---- 2023/10/31 17:34 pnacl
d---- 2023/11/1 16:41 PnaclTranslationCache
d---- 2023/11/5 17:32 PrivacySandboxAttestationsPreloaded
d---- 2023/10/27 9:43 RecoveryImproved
d---- 2023/11/1 16:38 Safe Browsing
d---- 2023/12/6 9:51 SafetyTips
d---- 2023/11/5 16:52 segmentation_platform
d---- 2023/10/27 9:43 ShaderCache
d---- 2023/10/31 19:48 SSLErrorAssistant
d---- 2023/10/27 12:06 Subresource Filter
d---- 2023/11/1 13:21 ThirdPartyModuleList64
d---- 2023/11/5 16:52 TpcdMetadata
d---- 2023/10/31 18:55 TrustTokenKeyCommitments
d---- 2023/11/5 18:06 Webstore Downloads
d---- 2023/10/27 9:43 WidevineCdm
d---- 2023/11/1 16:22 ZxcvbnData
-a--- 2023/12/14 16:23 59 DevToolsActivePort
-a--- 2023/11/5 18:05 451968 en-US-10-1.bdic
-a--- 2023/11/1 16:37 0 First Run
-a--- 2023/11/5 18:08 57344 first_party_sets.db
-a--- 2023/11/5 18:08 0 first_party_sets.db-journal
-a--- 2023/12/4 22:33 106 Last Browser
-a--- 2023/12/14 16:23 13 Last Version
-a--- 2023/12/14 16:24 77401 Local State
-a--- 2023/12/14 16:23 87 Variations
Originally posted by @galaxyeye in #51 (comment)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.