Code Monkey home page Code Monkey logo

platonai / pulsarrpa Goto Github PK

View Code? Open in Web Editor NEW
627.0 19.0 104.0 23.02 MB

Automate webpages at scale, scrape web data completely and accurately with high performance, distributed RPA.

License: GNU Affero General Public License v3.0

Java 44.47% Shell 0.30% Batchfile 0.02% Kotlin 54.05% PHP 0.02% Rich Text Format 0.04% JavaScript 0.95% TypeScript 0.01% PowerShell 0.15%
web-crawler web-mining data-science web-sql crawler scraper scraping web-scraping data-mining rpa

pulsarrpa's People

Contributors

dependabot[bot] avatar galaxyeye avatar insidegalaxyeye avatar platonai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pulsarrpa's Issues

[main] INFO ai.platon.pulsar.crawl.component.LoadComponent.Task - 3. 💔 💿 U got 1600 0 <- 0 in , fc:1 ProtoNotFound(1600)

"C:\Program Files\Java\jdk-20\bin\java.exe" "-javaagent:C:\Program Files\JetBrains\IntelliJ IDEA Community Edition 2023.1.3\lib\idea_rt.jar=3907:C:\Program Files\JetBrains\IntelliJ IDEA Community Edition 2023.1.3\bin" -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.stderr.encoding=UTF-8 -classpath C:\Users\Administrator\IdeaProjects\PulsarContexts\target\classes;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-stdlib-jdk8\1.8.21\kotlin-stdlib-jdk8-1.8.21.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-stdlib\1.8.21\kotlin-stdlib-1.8.21.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-stdlib-common\1.8.21\kotlin-stdlib-common-1.8.21.jar;C:\Users\Administrator.m2\repository\org\jetbrains\annotations\13.0\annotations-13.0.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-stdlib-jdk7\1.8.21\kotlin-stdlib-jdk7-1.8.21.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-skeleton\1.10.12\pulsar-skeleton-1.10.12.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-common\1.10.12\pulsar-common-1.10.12.jar;C:\Users\Administrator.m2\repository\org\springframework\spring-core\5.3.17\spring-core-5.3.17.jar;C:\Users\Administrator.m2\repository\org\springframework\spring-jcl\5.3.17\spring-jcl-5.3.17.jar;C:\Users\Administrator.m2\repository\xml-apis\xml-apis\1.3.04\xml-apis-1.3.04.jar;C:\Users\Administrator.m2\repository\org\apache\httpcomponents\httpclient\4.5.13\httpclient-4.5.13.jar;C:\Users\Administrator.m2\repository\org\apache\httpcomponents\httpcore\4.4.13\httpcore-4.4.13.jar;C:\Users\Administrator.m2\repository\commons-logging\commons-logging\1.2\commons-logging-1.2.jar;C:\Users\Administrator.m2\repository\commons-codec\commons-codec\1.11\commons-codec-1.11.jar;C:\Users\Administrator.m2\repository\com\ibm\icu\icu4j\4.0.1\icu4j-4.0.1.jar;C:\Users\Administrator.m2\repository\commons-io\commons-io\2.11.0\commons-io-2.11.0.jar;C:\Users\Administrator.m2\repository\org\apache\commons\commons-lang3\3.12.0\commons-lang3-3.12.0.jar;C:\Users\Administrator.m2\repository\org\apache\commons\commons-math3\3.3\commons-math3-3.3.jar;C:\Users\Administrator.m2\repository\org\codehaus\woodstox\stax2-api\4.2.1\stax2-api-4.2.1.jar;C:\Users\Administrator.m2\repository\com\fasterxml\woodstox\woodstox-core\6.4.0\woodstox-core-6.4.0.jar;C:\Users\Administrator.m2\repository\com\fasterxml\jackson\module\jackson-module-kotlin\2.13.4\jackson-module-kotlin-2.13.4.jar;C:\Users\Administrator.m2\repository\com\fasterxml\jackson\core\jackson-databind\2.13.4\jackson-databind-2.13.4.jar;C:\Users\Administrator.m2\repository\com\fasterxml\jackson\core\jackson-annotations\2.13.4\jackson-annotations-2.13.4.jar;C:\Users\Administrator.m2\repository\com\fasterxml\jackson\dataformat\jackson-dataformat-properties\2.13.4\jackson-dataformat-properties-2.13.4.jar;C:\Users\Administrator.m2\repository\com\fasterxml\jackson\core\jackson-core\2.13.4\jackson-core-2.13.4.jar;C:\Users\Administrator.m2\repository\com\fasterxml\jackson\datatype\jackson-datatype-jsr310\2.13.4\jackson-datatype-jsr310-2.13.4.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-serialization\1.5.32\kotlin-serialization-1.5.32.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-gradle-plugin-api\1.5.32\kotlin-gradle-plugin-api-1.5.32.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-native-utils\1.5.32\kotlin-native-utils-1.5.32.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-util-io\1.5.32\kotlin-util-io-1.5.32.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-project-model\1.5.32\kotlin-project-model-1.5.32.jar;C:\Users\Administrator.m2\repository\org\nibor\autolink\autolink\0.10.0\autolink-0.10.0.jar;C:\Users\Administrator.m2\repository\ch\qos\logback\logback-classic\1.2.9\logback-classic-1.2.9.jar;C:\Users\Administrator.m2\repository\ch\qos\logback\logback-core\1.2.9\logback-core-1.2.9.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-persist\1.10.12\pulsar-persist-1.10.12.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\gora-shaded-mongodb\0.8\gora-shaded-mongodb-0.8.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-jsoup\1.14.3\pulsar-jsoup-1.14.3.jar;C:\Users\Administrator.m2\repository\org\apache\avro\avro\1.8.1\avro-1.8.1.jar;C:\Users\Administrator.m2\repository\org\codehaus\jackson\jackson-core-asl\1.9.13\jackson-core-asl-1.9.13.jar;C:\Users\Administrator.m2\repository\org\codehaus\jackson\jackson-mapper-asl\1.9.13\jackson-mapper-asl-1.9.13.jar;C:\Users\Administrator.m2\repository\com\thoughtworks\paranamer\paranamer\2.7\paranamer-2.7.jar;C:\Users\Administrator.m2\repository\org\xerial\snappy\snappy-java\1.1.1.3\snappy-java-1.1.1.3.jar;C:\Users\Administrator.m2\repository\org\tukaani\xz\1.5\xz-1.5.jar;C:\Users\Administrator.m2\repository\org\apache\gora\gora-core\0.8\gora-core-0.8.jar;C:\Users\Administrator.m2\repository\org\apache\cxf\cxf-rt-frontend-jaxrs\2.5.2\cxf-rt-frontend-jaxrs-2.5.2.jar;C:\Users\Administrator.m2\repository\org\apache\cxf\cxf-common-utilities\2.5.2\cxf-common-utilities-2.5.2.jar;C:\Users\Administrator.m2\repository\org\apache\ws\xmlschema\xmlschema-core\2.0.1\xmlschema-core-2.0.1.jar;C:\Users\Administrator.m2\repository\org\codehaus\woodstox\woodstox-core-asl\4.1.1\woodstox-core-asl-4.1.1.jar;C:\Users\Administrator.m2\repository\org\apache\cxf\cxf-api\2.5.2\cxf-api-2.5.2.jar;C:\Users\Administrator.m2\repository\org\apache\neethi\neethi\3.0.1\neethi-3.0.1.jar;C:\Users\Administrator.m2\repository\wsdl4j\wsdl4j\1.6.2\wsdl4j-1.6.2.jar;C:\Users\Administrator.m2\repository\org\apache\cxf\cxf-rt-core\2.5.2\cxf-rt-core-2.5.2.jar;C:\Users\Administrator.m2\repository\com\sun\xml\bind\jaxb-impl\2.1.13\jaxb-impl-2.1.13.jar;C:\Users\Administrator.m2\repository\org\apache\geronimo\specs\geronimo-javamail_1.4_spec\1.7.1\geronimo-javamail_1.4_spec-1.7.1.jar;C:\Users\Administrator.m2\repository\javax\ws\rs\jsr311-api\1.1.1\jsr311-api-1.1.1.jar;C:\Users\Administrator.m2\repository\org\apache\cxf\cxf-rt-bindings-xml\2.5.2\cxf-rt-bindings-xml-2.5.2.jar;C:\Users\Administrator.m2\repository\org\apache\cxf\cxf-rt-transports-http\2.5.2\cxf-rt-transports-http-2.5.2.jar;C:\Users\Administrator.m2\repository\org\apache\cxf\cxf-rt-transports-common\2.5.2\cxf-rt-transports-common-2.5.2.jar;C:\Users\Administrator.m2\repository\org\springframework\spring-web\3.0.6.RELEASE\spring-web-3.0.6.RELEASE.jar;C:\Users\Administrator.m2\repository\aopalliance\aopalliance\1.0\aopalliance-1.0.jar;C:\Users\Administrator.m2\repository\org\codehaus\jettison\jettison\1.3.1\jettison-1.3.1.jar;C:\Users\Administrator.m2\repository\org\apache\avro\avro-mapred\1.8.1\avro-mapred-1.8.1.jar;C:\Users\Administrator.m2\repository\org\apache\avro\avro-ipc\1.8.1\avro-ipc-1.8.1.jar;C:\Users\Administrator.m2\repository\org\mortbay\jetty\jetty\6.1.26\jetty-6.1.26.jar;C:\Users\Administrator.m2\repository\org\mortbay\jetty\jetty-util\6.1.26\jetty-util-6.1.26.jar;C:\Users\Administrator.m2\repository\io\netty\netty\3.5.13.Final\netty-3.5.13.Final.jar;C:\Users\Administrator.m2\repository\commons-lang\commons-lang\2.6\commons-lang-2.6.jar;C:\Users\Administrator.m2\repository\org\apache\gora\gora-compiler\0.8\gora-compiler-0.8.jar;C:\Users\Administrator.m2\repository\org\apache\avro\avro-compiler\1.8.1\avro-compiler-1.8.1.jar;C:\Users\Administrator.m2\repository\org\apache\velocity\velocity\1.7\velocity-1.7.jar;C:\Users\Administrator.m2\repository\joda-time\joda-time\2.7\joda-time-2.7.jar;C:\Users\Administrator.m2\repository\org\jgrapht\jgrapht-core\1.0.0\jgrapht-core-1.0.0.jar;C:\Users\Administrator.m2\repository\org\jgrapht\jgrapht-ext\1.0.0\jgrapht-ext-1.0.0.jar;C:\Users\Administrator.m2\repository\org\tinyjee\jgraphx\jgraphx\2.0.0.1\jgraphx-2.0.0.1.jar;C:\Users\Administrator.m2\repository\jgraph\jgraph\5.13.0.0\jgraph-5.13.0.0.jar;C:\Users\Administrator.m2\repository\org\antlr\antlr4-runtime\4.5.3\antlr4-runtime-4.5.3.jar;C:\Users\Administrator.m2\repository\org\springframework\spring-context\5.3.17\spring-context-5.3.17.jar;C:\Users\Administrator.m2\repository\org\springframework\spring-aop\5.3.17\spring-aop-5.3.17.jar;C:\Users\Administrator.m2\repository\org\springframework\spring-beans\5.3.17\spring-beans-5.3.17.jar;C:\Users\Administrator.m2\repository\org\springframework\spring-expression\5.3.17\spring-expression-5.3.17.jar;C:\Users\Administrator.m2\repository\javax\xml\bind\jaxb-api\2.3.1\jaxb-api-2.3.1.jar;C:\Users\Administrator.m2\repository\javax\activation\javax.activation-api\1.2.0\javax.activation-api-1.2.0.jar;C:\Users\Administrator.m2\repository\commons-collections\commons-collections\3.2.2\commons-collections-3.2.2.jar;C:\Users\Administrator.m2\repository\org\apache\hadoop\hadoop-common\2.7.2\hadoop-common-2.7.2.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-dom\1.10.12\pulsar-dom-1.10.12.jar;C:\Users\Administrator.m2\repository\com\udojava\EvalEx\2.0\EvalEx-2.0.jar;C:\Users\Administrator.m2\repository\org\perf4j\perf4j\0.9.16\perf4j-0.9.16.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-browser\1.10.12\pulsar-browser-1.10.12.jar;C:\Users\Administrator.m2\repository\io\dropwizard\metrics\metrics-core\4.1.29\metrics-core-4.1.29.jar;C:\Users\Administrator.m2\repository\javax\websocket\javax.websocket-api\1.1\javax.websocket-api-1.1.jar;C:\Users\Administrator.m2\repository\org\glassfish\tyrus\tyrus-container-grizzly-client\1.13.1\tyrus-container-grizzly-client-1.13.1.jar;C:\Users\Administrator.m2\repository\org\glassfish\grizzly\grizzly-framework\2.3.25\grizzly-framework-2.3.25.jar;C:\Users\Administrator.m2\repository\org\glassfish\grizzly\grizzly-http-server\2.3.25\grizzly-http-server-2.3.25.jar;C:\Users\Administrator.m2\repository\org\glassfish\grizzly\grizzly-http\2.3.25\grizzly-http-2.3.25.jar;C:\Users\Administrator.m2\repository\org\glassfish\tyrus\tyrus-client\1.13.1\tyrus-client-1.13.1.jar;C:\Users\Administrator.m2\repository\org\glassfish\tyrus\tyrus-core\1.13.1\tyrus-core-1.13.1.jar;C:\Users\Administrator.m2\repository\org\glassfish\tyrus\tyrus-spi\1.13.1\tyrus-spi-1.13.1.jar;C:\Users\Administrator.m2\repository\com\github\kklisura\cdt\cdt-java-client\4.0.0\cdt-java-client-4.0.0.jar;C:\Users\Administrator.m2\repository\org\javassist\javassist\3.24.1-GA\javassist-3.24.1-GA.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-ql-common\1.10.12\pulsar-ql-common-1.10.12.jar;C:\Users\Administrator.m2\repository\ai\platon\pulsar\pulsar-h2\1.4.196\pulsar-h2-1.4.196.jar;C:\Users\Administrator.m2\repository\org\apache\commons\commons-collections4\4.4\commons-collections4-4.4.jar;C:\Users\Administrator.m2\repository\com\google\code\crawler-commons\crawler-commons\0.5\crawler-commons-0.5.jar;C:\Users\Administrator.m2\repository\org\apache\tika\tika-core\1.6\tika-core-1.6.jar;C:\Users\Administrator.m2\repository\org\slf4j\slf4j-api\1.7.7\slf4j-api-1.7.7.jar;C:\Users\Administrator.m2\repository\com\google\guava\guava\30.1-jre\guava-30.1-jre.jar;C:\Users\Administrator.m2\repository\com\google\guava\failureaccess\1.0.1\failureaccess-1.0.1.jar;C:\Users\Administrator.m2\repository\com\google\guava\listenablefuture\9999.0-empty-to-avoid-conflict-with-guava\listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar;C:\Users\Administrator.m2\repository\com\google\code\findbugs\jsr305\3.0.2\jsr305-3.0.2.jar;C:\Users\Administrator.m2\repository\org\checkerframework\checker-qual\3.5.0\checker-qual-3.5.0.jar;C:\Users\Administrator.m2\repository\com\google\errorprone\error_prone_annotations\2.3.4\error_prone_annotations-2.3.4.jar;C:\Users\Administrator.m2\repository\com\google\j2objc\j2objc-annotations\1.3\j2objc-annotations-1.3.jar;C:\Users\Administrator.m2\repository\com\google\code\gson\gson\2.10.1\gson-2.10.1.jar;C:\Users\Administrator.m2\repository\oro\oro\2.0.8\oro-2.0.8.jar;C:\Users\Administrator.m2\repository\com\beust\jcommander\1.81\jcommander-1.81.jar;C:\Users\Administrator.m2\repository\com\github\oshi\oshi-core\5.6.1\oshi-core-5.6.1.jar;C:\Users\Administrator.m2\repository\net\java\dev\jna\jna\5.8.0\jna-5.8.0.jar;C:\Users\Administrator.m2\repository\net\java\dev\jna\jna-platform\5.8.0\jna-platform-5.8.0.jar;C:\Users\Administrator.m2\repository\io\dropwizard\metrics\metrics-graphite\4.1.29\metrics-graphite-4.1.29.jar;C:\Users\Administrator.m2\repository\com\rabbitmq\amqp-client\5.14.0\amqp-client-5.14.0.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlinx\kotlinx-coroutines-jdk8\1.6.4\kotlinx-coroutines-jdk8-1.6.4.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlinx\kotlinx-coroutines-core-jvm\1.6.4\kotlinx-coroutines-core-jvm-1.6.4.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlin\kotlin-reflect\1.5.32\kotlin-reflect-1.5.32.jar;C:\Users\Administrator.m2\repository\org\jetbrains\kotlinx\kotlinx-coroutines-core\1.6.4\kotlinx-coroutines-core-1.6.4.jar ai.platon.pulsar.examples.sites.topEc.english.amazon.MainKt
16:14:22.745 [main] INFO ai.platon.pulsar.common.config.AbstractConfiguration - Find legacy resource: jar:file:/C:/Users/Administrator/.m2/repository/ai/platon/pulsar/pulsar-skeleton/1.10.12/pulsar-skeleton-1.10.12.jar!/config/legacy/pulsar-default.xml
16:14:22.748 [main] INFO ai.platon.pulsar.common.config.AbstractConfiguration - Find legacy resource: jar:file:/C:/Users/Administrator/.m2/repository/ai/platon/pulsar/pulsar-skeleton/1.10.12/pulsar-skeleton-1.10.12.jar!/config/legacy/pulsar-site.xml
16:14:22.749 [main] INFO ai.platon.pulsar.common.config.AbstractConfiguration - Resource not find: pulsar-task.xml
16:14:22.774 [main] INFO ai.platon.pulsar.common.config.AbstractConfiguration - profile: <> | [pulsar-default.xml, pulsar-site.xml]
16:14:22.792 [main] INFO ai.platon.pulsar.crawl.protocol.ProtocolFactory - Supported protocols:
16:14:22.812 [main] INFO ai.platon.pulsar.crawl.parse.html.PrimerHtmlParser - className: PrimerHtmlParser defaultCharEncoding: utf-8
16:14:22.879 [main] INFO ai.platon.pulsar.crawl.parse.PageParser - maxParseTime: PT1M maxParsedLinks: 200 groupMode: BY_HOST ignoreExternalLinks: false maxUrlLength: 1024
16:14:22.904 [main] INFO ai.platon.pulsar.crawl.impl.StreamingCrawlLoop - Crawl loop is created | @977552154
16:14:22.906 [main] DEBUG org.springframework.context.support.StaticApplicationContext - Refreshing org.springframework.context.support.StaticApplicationContext@58651fd0
16:14:22.953 [main] INFO ai.platon.pulsar.context.PulsarContexts - Active context | ai.platon.pulsar.context.support.StaticPulsarContext#1
16:14:23.985 [main] INFO ai.platon.pulsar.persist.gora.GoraStorage - Backend data store: FileBackendPageStore realSchema: FileBackendPageStore
16:14:24.112 [main] INFO ai.platon.pulsar.persist.AutoDetectStorageProvider - Storage is created: class ai.platon.pulsar.persist.gora.FileBackendPageStore realSchema: FileBackendPageStore
16:14:24.188 [main] INFO ai.platon.pulsar.crawl.component.LoadComponent.Task - 3. 💔 💿 U got 1600 0 <- 0 in , fc:1 ProtoNotFound(1600) | https://www.amazon.com/Best-Sellers/zgbs -outLinkSelector a[href~=/dp/]
16:14:24.188 [main] INFO ai.platon.pulsar.crawl.component.LoadComponent.Task - Log explanation: https://github.com/platonai/pulsarr/blob/master/docs/log-format.adoc
16:14:24.307 [main] INFO ai.platon.pulsar.crawl.impl.StreamingCrawlLoop - Registered 15 link collectors | loop#1 @977552154
[]
16:14:24.330 [SpringContextShutdownHook] DEBUG org.springframework.context.support.StaticApplicationContext - Closing org.springframework.context.support.StaticApplicationContext@58651fd0, started on Sun Jun 25 16:14:22 CST 2023
16:14:24.330 [Thread-0] INFO ai.platon.pulsar.context.support.AbstractPulsarContext - Closing context #1/2 | StaticPulsarContext
16:14:24.331 [Thread-0] INFO ai.platon.pulsar.session.AbstractPulsarSession - Session is closed | #1000002
16:14:24.331 [Thread-0] INFO ai.platon.pulsar.session.AbstractPulsarSession - Session is closed | #1000001
16:14:24.331 [DefaultDispatcher-worker-1] INFO ai.platon.pulsar.crawl.impl.StreamingCrawler - Starting StreamingCrawler #1 ...

Process finished with exit code 0

提示协议未找到,大概什么原因

如何修改浏览器默认启动参数

"C:\Program Files\Google\Chrome\Application\chrome.exe" --disable-gpu --hide-scrollbars --remote-debugging-port=0 --no-default-browser-check --no-first-run --no-startup-window --mute-audio --disable-background-networking --disable-background-timer-throttling --disable-client-side-phishing-detection --disable-hang-monitor --disable-popup-blocking --disable-prompt-on-repost --disable-sync --disable-translate --disable-blink-features=AutomationControlled --metrics-recording-only --safebrowsing-disable-auto-update --no-sandbox --ignore-certificate-errors --window-size=1920,1080 --pageLoadStrategy=none --throwExceptionOnScriptError=true

Failed to create symbol link on Windows

2022-10-15 14:44:12.146 WARN [-worker-12] a.p.p.p.b.e.i.BrowserEmulatorImplBase - java.nio.file.FileSystemException: C:\Users\VINCEN~1\AppData\Local\Temp\ln\5a6caaaaa8aaf6e230182a2bbad7c43c.htm: 客户端没有所需的特权。

Environment:

OS: Windows 11
JDK: Java 11
Commnad line: "C:\Program Files\Java\jdk-11.0.2\bin\java.exe" "-javaagent:D:\Program Files\JetBrains\IntelliJ IDEA 2022.1.3\lib\idea_rt.jar=61275:D:\Program Files\JetBrains\IntelliJ IDEA 2022.1.3\bin" -Dfile.encoding=UTF-8 -classpath "..." ai.platon.exotic.examples.sites.walmart.WalmartCrawlerKt

Failed to copy chrome data dir when there is a SingletonSocket symbol link

21:19:13.336 [main] INFO a.p.p.b.driver.chrome.ChromeLauncher - User data dir does not exist, copy from prototype | /tmp/pulsar-vincent/context/cx.1iAggh21/pulsar_chrome <- /home/vincent/.pulsar/browser/chrome/prototype/google-chrome
21:19:18.060 [main] WARN a.p.p.b.driver.chrome.ChromeLauncher - Failed to prepare user data dir
java.lang.IllegalArgumentException: Parameter 'srcFile' is not a file: /home/vincent/.pulsar/browser/chrome/prototype/google-chrome/SingletonSocket
at org.apache.commons.io.FileUtils.requireFile(FileUtils.java:2737)
at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:841)
at org.apache.commons.io.FileUtils.doCopyDirectory(FileUtils.java:1312)
at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:699)
at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:630)
at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:531)
at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:502)
at ai.platon.pulsar.browser.driver.chrome.ChromeLauncher.prepareUserDataDir(ChromeLauncher.kt:236)
at ai.platon.pulsar.browser.driver.chrome.ChromeLauncher.launch(ChromeLauncher.kt:50)
at ai.platon.pulsar.browser.driver.chrome.ChromeLauncher.launch(ChromeLauncher.kt:61)
at ai.platon.pulsar.protocol.browser.driver.BrowserFactory.launchChromeDevtoolsBrowser(BrowserFactory.kt:40)
at ai.platon.pulsar.protocol.browser.driver.BrowserFactory.launch(BrowserFactory.kt:22)
at ai.platon.pulsar.protocol.browser.driver.BrowserManager.launchIfAbsent$lambda-12(BrowserManager.kt:106)
at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1705)
at ai.platon.pulsar.protocol.browser.driver.BrowserManager.launchIfAbsent(BrowserManager.kt:105)
at ai.platon.pulsar.protocol.browser.driver.BrowserManager.launch(BrowserManager.kt:38)

BrowserEmulator - Unexpected exception

2022-05-22 16:49:04.067 WARN [r-worker-5] a.p.p.p.b.e.BrowserEmulator - Unexpected exception

java.lang.StringIndexOutOfBoundsException: String index out of range: 15
at java.base/java.lang.StringLatin1.charAt(StringLatin1.java:48)
at java.base/java.lang.String.charAt(String.java:711)
at ai.platon.pulsar.common.HtmlUtils.isBlankBody(Htmls.kt:107)
at ai.platon.pulsar.protocol.browser.emulator.EmulateEventHandler.checkHtmlIntegrity(EmulateEventHandler.kt:137)
at ai.platon.pulsar.protocol.browser.emulator.EmulateEventHandler.onAfterNavigate(EmulateEventHandler.kt:89)
at ai.platon.pulsar.protocol.browser.emulator.BrowserEmulator.browseWithMinorExceptionsHandled(BrowserEmulator.kt:180)
at ai.platon.pulsar.protocol.browser.emulator.BrowserEmulator.access$browseWithMinorExceptionsHandled(BrowserEmulator.kt:34)
at ai.platon.pulsar.protocol.browser.emulator.BrowserEmulator$browseWithMinorExceptionsHandled$1.invokeSuspend(BrowserEmulator.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)

Chrome 111 support.

Doesn't work with chrome v111.

First reported at platonai/exotic-amazon#16 .

Our users reported the incompatible issue. We diagnosed the problem, and find out that chrome-111's browser_protocol.json and js_protocol.json have changed a lot.

Solution:

  1. change http method to be PUT to create/activate/close a tab
  2. add chrome launch parameter: --remote-allow-origins=*

TypeError: document.body.HMNvqKforEach is not a function

When run exotic-standalone on WSL, we see:

2022-05-20 15:51:31.107 INFO [r-worker-4] a.p.p.p.b.d.c.ChromeDevtoolsDriver - TypeError: document.body.HMNvqKforEach is not a function
at Function.HMNvqKutils__.updatePulsarStat (:300:23)
at Function.HMNvqKutils__.isActuallyReady (:237:19)
at Function.HMNvqKutils__.checkPulsarStatus (:178:31)
at Function.HMNvqKutils__.waitForReady (:151:26)
at :1:15

We thought it's caused by the js resources are loaded twice and the scriptNameCipher might be calculated twice for some reason:

2022-05-20 15:48:05.973 INFO [r-worker-1] a.p.p.c.ResourceLoader - Find resource js/pulsar_utils.js | jar:file:/home/vincent/workspace/exotic-standalone.jar!/BOOT-INF/lib/pulsar-browser-1.9.6.jar!/js/pulsar_utils.js
2022-05-20 15:48:05.973 INFO [r-worker-2] a.p.p.c.ResourceLoader - Find resource js/pulsar_utils.js | jar:file:/home/vincent/workspace/exotic-standalone.jar!/BOOT-INF/lib/pulsar-browser-1.9.6.jar!/js/pulsar_utils.js

I suggest that make sure the js resources are loaded only once and make sure the scriptNameCipher be unique in process scope.

Reading from FileBackendPageStore failed.

Reading from FileBackendPageStore failed.

Exception in thread "main" java.nio.file.FileSystemException: C:\Users\Vincent Zhang.pulsar\data\store\nbzfcg-cn\nbzfcg-cn-5fb8f1e5b8322a31bb42dbdcee9d256f.avro: 另一个程序正在使用此文件,进程无法访问。
at java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92)
at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103)
at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:108)
at java.base/sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:274)
at java.base/sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:110)
at java.base/java.nio.file.Files.deleteIfExists(Files.java:1185)
at ai.platon.pulsar.persist.gora.FileBackendPageStore.readAvro(FileBackendPageStore.kt:98)
at ai.platon.pulsar.persist.gora.FileBackendPageStore.get(FileBackendPageStore.kt:41)
at ai.platon.pulsar.persist.gora.FileBackendPageStore.get(FileBackendPageStore.kt:30)
at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156)
at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56)
at ai.platon.pulsar.persist.WebDb.getOrNull(WebDb.kt:71)
at ai.platon.pulsar.persist.WebDb.getOrNull$default(WebDb.kt:65)
at ai.platon.pulsar.crawl.component.LoadComponent.createPageShell(LoadComponent.kt:259)
at ai.platon.pulsar.crawl.component.LoadComponent.loadDeferred0(LoadComponent.kt:206)
at ai.platon.pulsar.crawl.component.LoadComponent.loadWithRetryDeferred(LoadComponent.kt:116)
at ai.platon.pulsar.crawl.component.LoadComponent.loadDeferred(LoadComponent.kt:95)
at ai.platon.pulsar.context.support.AbstractPulsarContext.loadDeferred$suspendImpl(AbstractPulsarContext.kt:329)
at ai.platon.pulsar.context.support.AbstractPulsarContext.loadDeferred(AbstractPulsarContext.kt)
at ai.platon.pulsar.session.AbstractPulsarSession.loadAndCacheDeferred(AbstractPulsarSession.kt:487)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred$suspendImpl(AbstractPulsarSession.kt:192)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred(AbstractPulsarSession.kt)
at ai.platon.scent.dm.HarvestRunner.loadDeferred(HarvestRunner.kt:251)
at ai.platon.scent.dm.HarvestRunner.access$loadDeferred(HarvestRunner.kt:40)
at ai.platon.scent.dm.HarvestRunner$loadDocumentsDeferred$2$1$1.invokeSuspend(HarvestRunner.kt:275)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)

PS C:\Users\Vincent Zhang.pulsar\proxy> java -version
openjdk version "14.0.2" 2020-07-14
OpenJDK Runtime Environment (build 14.0.2+12-46)
OpenJDK 64-Bit Server VM (build 14.0.2+12-46, mixed mode, sharing)

PS C:\Users\Vincent Zhang.pulsar\proxy> Get-ComputerInfo -Property “os*” | select OSName, OsArchitecture

OsName OsArchitecture


Microsoft Windows 11 家庭中文版 64 位

InaccessibleObjectException: Unable to make field private final long java.time.Duration.seconds accessible

bin/build-run.sh
// OK
2023-02-16 22:12:09.736 INFO [main] a.p.p.a.m.PulsarMasterKt - Starting PulsarMasterKt v1.10.10-SNAPSHOT using Java 17.0.5 on regulus with PID 21576 (/home/vincent/workspace/pulsar-1.10.x/pulsar-app/pulsar-master/target/pulsar-master-1.10.10-SNAPSHOT.jar started by vincent in /home/vincent/workspace/pulsar-1.10.x)

And then we issue an X-SQL to scrape:

bin/scrape.sh
// The server issues the warning message

...
...
...

2023-02-16 22:14:14.929 WARN [r-worker-2] a.p.p.c.i.StreamingCrawler - [Unexpected]

java.lang.reflect.InaccessibleObjectException: Unable to make field private final long java.time.Duration.seconds accessible: module java.base does not "opens java.time" to unnamed module @62bd2070
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
at java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:178)
at java.base/java.lang.reflect.Field.setAccessible(Field.java:172)
at com.google.gson.internal.reflect.UnsafeReflectionAccessor.makeAccessible(UnsafeReflectionAccessor.java:44)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory.getBoundFields(ReflectiveTypeAdapterFactory.java:159)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory.create(ReflectiveTypeAdapterFactory.java:102)
at com.google.gson.Gson.getAdapter(Gson.java:489)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory.createBoundField(ReflectiveTypeAdapterFactory.java:117)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory.getBoundFields(ReflectiveTypeAdapterFactory.java:166)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory.create(ReflectiveTypeAdapterFactory.java:102)
at com.google.gson.Gson.getAdapter(Gson.java:489)
at com.google.gson.Gson.toJson(Gson.java:727)
at com.google.gson.Gson.toJson(Gson.java:714)
at com.google.gson.Gson.toJson(Gson.java:669)
at com.google.gson.Gson.toJson(Gson.java:649)
at ai.platon.pulsar.browser.common.InteractSettings.overrideConfiguration(BrowserSettings.kt:391)
at ai.platon.pulsar.common.options.LoadOptions.overrideConfiguration(LoadOptions.kt:691)
at ai.platon.pulsar.common.options.LoadOptions.overrideConfiguration(LoadOptions.kt:671)
at ai.platon.pulsar.common.urls.CombinedUrlNormalizer.createLoadOptions(CombinedUrlNormalizer.kt:49)
at ai.platon.pulsar.common.urls.CombinedUrlNormalizer.normalize(CombinedUrlNormalizer.kt:21)
at ai.platon.pulsar.context.support.AbstractPulsarContext.normalize(AbstractPulsarContext.kt:209)
at ai.platon.pulsar.session.AbstractPulsarSession.normalize(AbstractPulsarSession.kt:132)
at ai.platon.pulsar.session.PulsarSession$DefaultImpls.normalize$default(PulsarSession.kt:187)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred$suspendImpl(AbstractPulsarSession.kt:185)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred(AbstractPulsarSession.kt)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.loadWithMinorExceptionHandled(StreamingCrawler.kt:475)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.access$loadWithMinorExceptionHandled(StreamingCrawler.kt:66)
at ai.platon.pulsar.crawl.impl.StreamingCrawler$loadWithTimeout$2.invokeSuspend(StreamingCrawler.kt:395)
at ai.platon.pulsar.crawl.impl.StreamingCrawler$loadWithTimeout$2.invoke(StreamingCrawler.kt)
at ai.platon.pulsar.crawl.impl.StreamingCrawler$loadWithTimeout$2.invoke(StreamingCrawler.kt)
at kotlinx.coroutines.intrinsics.UndispatchedKt.startUndispatchedOrReturnIgnoreTimeout(Undispatched.kt:100)
at kotlinx.coroutines.TimeoutKt.setupTimeout(Timeout.kt:146)
at kotlinx.coroutines.TimeoutKt.withTimeout(Timeout.kt:44)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.loadWithTimeout(StreamingCrawler.kt:394)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.runLoadTaskWithEventHandlers(StreamingCrawler.kt:376)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.access$runLoadTaskWithEventHandlers(StreamingCrawler.kt:66)
at ai.platon.pulsar.crawl.impl.StreamingCrawler$runWithStatusCheck$2.invokeSuspend(StreamingCrawler.kt:354)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:570)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:677)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:664)

OutOfMemoryError: Java heap space from BrowserEmulatorImplBase.createResponse

15:26:41.436 [-worker-15] WARN a.p.p.p.b.e.i.BrowserEmulatedFetcherImpl - [Unexpected] Failed to visit page | https://www.google.de/search?q=Favorite+World%2C+LLC+Telefon
java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.StringUTF16.compress(StringUTF16.java:168)
at java.base/java.lang.StringUTF16.newString(StringUTF16.java:1019)
at java.base/java.lang.StringBuilder.toString(StringBuilder.java:453)
at ai.platon.pulsar.protocol.browser.emulator.impl.BrowserEmulatorImplBase.createResponse(BrowserEmulatorImplBase.kt:89)
at ai.platon.pulsar.protocol.browser.emulator.impl.InteractiveBrowserEmulator.browseWithWebDriver(InteractiveBrowserEmulator.kt:325)
at ai.platon.pulsar.protocol.browser.emulator.impl.InteractiveBrowserEmulator.access$browseWithWebDriver(InteractiveBrowserEmulator.kt:37)
at ai.platon.pulsar.protocol.browser.emulator.impl.InteractiveBrowserEmulator$browseWithWebDriver$1.invokeSuspend(InteractiveBrowserEmulator.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)

Java OutOfMemoryError

I use ScentSQLContext to executeQuery, foreach urls list in sqlFunction: load_and_select(@url,'css')
the java heap go up by times and never down , and then two hours later , it throw OutOfMemoryError.

I have execute rs.close() , but the java heap go head.
the dump file shows NodeList contains many byte[], such as :
image

How can I resolve this issue ?

Illegal reflective access by javassist.util.proxy.SecurityActions

The following warnings appears every time we run PulsarRPA examples/demos/services:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by javassist.util.proxy.SecurityActions (file:/C:/Users/pereg/.m2/repository/javassist/javassist/3.12.1.GA/javassist-3.12.1.GA.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of javassist.util.proxy.SecurityActions
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

Under Windows, OSHI and kotlin are not compatible

The package com.sun.jna comes with kotlin, but there is no close() method in com.sun.jna.Memory.

2023-09-11 21:47:40.489 INFO [main] a.p.p.p.b.e.c.BasicPrivacyContextManager - Privacy context is created #091119IXKO1
java.lang.NoSuchMethodError: com.sun.jna.Memory.close()V
at oshi.util.Util.freeMemory(Util.java:83)
at oshi.jna.ByRef$CloseableHANDLEByReference.close(ByRef.java:95)
at oshi.software.os.windows.WindowsOperatingSystem.enableDebugPrivilege(WindowsOperatingSystem.java:469)
at oshi.software.os.windows.WindowsOperatingSystem.(WindowsOperatingSystem.java:105)
at oshi.SystemInfo.createOperatingSystem(SystemInfo.java:82)
at oshi.util.Memoizer$1.get(Memoizer.java:61)
at oshi.SystemInfo.getOperatingSystem(SystemInfo.java:76)
at ai.platon.pulsar.common.AppSystemInfo$Companion.isOSHIAvailable(AppSystemInfo.kt:132)
at ai.platon.pulsar.common.AppSystemInfo.(AppSystemInfo.kt:30)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.shouldCreateWebDriver(LoadingWebDriverPool.kt:370)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.resourceSafeCreateDriverIfNecessary(LoadingWebDriverPool.kt:334)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.pollWebDriver(LoadingWebDriverPool.kt:314)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.poll(LoadingWebDriverPool.kt:202)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.poll(LoadingWebDriverPool.kt:197)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.pollWithEvents(LoadingWebDriverPool.kt:304)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.poll(LoadingWebDriverPool.kt:224)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.runWithDriverPool(WebDriverPoolManager.kt:493)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.access$runWithDriverPool(WebDriverPoolManager.kt:32)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager$runWithDriverPool$2.invokeSuspend(WebDriverPoolManager.kt:461)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager$runWithDriverPool$2.invoke(WebDriverPoolManager.kt)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager$runWithDriverPool$2.invoke(WebDriverPoolManager.kt)
at ai.platon.pulsar.common.PreemptChannelSupport.whenNormalDeferred(PreemptChannelSupport.kt:58)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.runWithDriverPool(WebDriverPoolManager.kt:449)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.doRun(WebDriverPoolManager.kt:398)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.run(WebDriverPoolManager.kt:157)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.run(WebDriverPoolManager.kt:137)
at ai.platon.pulsar.protocol.browser.emulator.context.WebDriverContext.run(WebDriverContext.kt:77)
at ai.platon.pulsar.protocol.browser.emulator.context.BrowserPrivacyContext.doRun$suspendImpl(BrowserPrivacyContext.kt:69)
at ai.platon.pulsar.protocol.browser.emulator.context.BrowserPrivacyContext.doRun(BrowserPrivacyContext.kt)
at ai.platon.pulsar.crawl.fetch.privacy.PrivacyContext.run$suspendImpl(PrivacyContext.kt:287)
at ai.platon.pulsar.crawl.fetch.privacy.PrivacyContext.run(PrivacyContext.kt)
at ai.platon.pulsar.protocol.browser.emulator.context.BasicPrivacyContextManager.run1(BasicPrivacyContextManager.kt:92)
at ai.platon.pulsar.protocol.browser.emulator.context.BasicPrivacyContextManager.run0(BasicPrivacyContextManager.kt:80)
at ai.platon.pulsar.protocol.browser.emulator.context.BasicPrivacyContextManager.run(BasicPrivacyContextManager.kt:34)
at ai.platon.pulsar.protocol.browser.emulator.impl.BrowserEmulatedFetcherImpl.fetchTaskDeferred(BrowserEmulatedFetcherImpl.kt:93)
at ai.platon.pulsar.protocol.browser.emulator.impl.BrowserEmulatedFetcherImpl.fetchContentDeferred$suspendImpl(BrowserEmulatedFetcherImpl.kt:80)
at ai.platon.pulsar.protocol.browser.emulator.impl.BrowserEmulatedFetcherImpl.fetchContentDeferred(BrowserEmulatedFetcherImpl.kt)
at ai.platon.pulsar.protocol.browser.emulator.impl.BrowserEmulatedFetcherImpl$fetchContent$1.invokeSuspend(BrowserEmulatedFetcherImpl.kt:57)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:284)
at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85)
at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
at ai.platon.pulsar.protocol.browser.emulator.impl.BrowserEmulatedFetcherImpl.fetchContent(BrowserEmulatedFetcherImpl.kt:56)
at ai.platon.pulsar.protocol.browser.BrowserEmulatorProtocol.getResponse(BrowserEmulatorProtocol.kt:43)
at ai.platon.pulsar.crawl.protocol.http.AbstractHttpProtocol.getProtocolOutputWithRetry(AbstractHttpProtocol.kt:118)
at ai.platon.pulsar.crawl.protocol.http.AbstractHttpProtocol.getProtocolOutput(AbstractHttpProtocol.kt:88)
at ai.platon.pulsar.crawl.component.FetchComponent.fetchContent0(FetchComponent.kt:108)
at ai.platon.pulsar.crawl.component.FetchComponent.fetchContent(FetchComponent.kt:75)
at ai.platon.pulsar.crawl.component.LoadComponent.fetchContent(LoadComponent.kt:505)
at ai.platon.pulsar.crawl.component.LoadComponent.fetchContentIfNecessary(LoadComponent.kt:266)
at ai.platon.pulsar.crawl.component.LoadComponent.load1(LoadComponent.kt:233)
at ai.platon.pulsar.crawl.component.LoadComponent.load0(LoadComponent.kt:227)
at ai.platon.pulsar.crawl.component.LoadComponent.loadWithRetry(LoadComponent.kt:129)
at ai.platon.pulsar.crawl.component.LoadComponent.load(LoadComponent.kt:117)
at ai.platon.pulsar.context.support.AbstractPulsarContext.load(AbstractPulsarContext.kt:367)
at ai.platon.pulsar.session.AbstractPulsarSession.loadAndCache(AbstractPulsarSession.kt:493)
at ai.platon.pulsar.session.AbstractPulsarSession.load(AbstractPulsarSession.kt:184)
at ai.platon.pulsar.session.AbstractPulsarSession.load(AbstractPulsarSession.kt:171)
at ai.platon.pulsar.session.AbstractPulsarSession.load(AbstractPulsarSession.kt:169)
at ai.platon.pulsar.examples._0_BasicUsageKt.main(_0_BasicUsage.kt:17)
at ai.platon.pulsar.examples._0_BasicUsageKt.main(_0_BasicUsage.kt)

MongoSocketReadException: Prematurely reached end of stream

Mongodb is already closed before MiscMessageWriter.close in which WebDb.flush is called. This happens when embeded mongodb is started in Exotic.

A possible solution is to remove the WebDb dependency by MiscMessageWriter.

2022-05-29 20:14:20.643 ERROR [utdownHook] a.p.p.p.WebDb - ai.platon.shaded.com.mongodb.MongoSocketReadException: Prematurely reached end of stream
at ai.platon.shaded.com.mongodb.internal.connection.SocketStream.read(SocketStream.java:112)
at ai.platon.shaded.com.mongodb.internal.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:579)
at ai.platon.shaded.com.mongodb.internal.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:444)
at ai.platon.shaded.com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:298)
at ai.platon.shaded.com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:258)
at ai.platon.shaded.com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:99)
at ai.platon.shaded.com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:450)
at ai.platon.shaded.com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:72)
at ai.platon.shaded.com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:226)
at ai.platon.shaded.com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:269)
at ai.platon.shaded.com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:131)
at ai.platon.shaded.com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:123)
at ai.platon.shaded.com.mongodb.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:343)
at ai.platon.shaded.com.mongodb.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:334)
at ai.platon.shaded.com.mongodb.operation.CommandOperationHelper.executeCommandWithConnection(CommandOperationHelper.java:220)
at ai.platon.shaded.com.mongodb.operation.CommandOperationHelper$5.call(CommandOperationHelper.java:206)
at ai.platon.shaded.com.mongodb.operation.OperationHelper.withReadConnectionSource(OperationHelper.java:463)
at ai.platon.shaded.com.mongodb.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:203)
at ai.platon.shaded.com.mongodb.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:198)
at ai.platon.shaded.com.mongodb.operation.CommandReadOperation.execute(CommandReadOperation.java:59)
at ai.platon.shaded.com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:194)
at ai.platon.shaded.com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:175)
at ai.platon.shaded.com.mongodb.DB.executeCommand(DB.java:775)
at ai.platon.shaded.com.mongodb.DB.command(DB.java:521)
at ai.platon.shaded.com.mongodb.DB.command(DB.java:537)
at ai.platon.shaded.com.mongodb.DB.command(DB.java:492)
at ai.platon.shaded.com.mongodb.Mongo.fsync(Mongo.java:648)
at org.apache.gora.mongodb.store.MongoStore.flush(MongoStore.java:294)
at ai.platon.pulsar.persist.WebDb.flush(WebDb.kt:261)
at ai.platon.pulsar.common.message.MiscMessageWriter.commit(MiscMessageWriter.kt:305)
at ai.platon.pulsar.common.message.MiscMessageWriter.close(MiscMessageWriter.kt:310)
at org.springframework.beans.factory.support.DisposableBeanAdapter.destroy(DisposableBeanAdapter.java:239)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroyBean(DefaultSingletonBeanRegistry.java:587)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingleton(DefaultSingletonBeanRegistry.java:559)
at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingleton(DefaultListableBeanFactory.java:1161)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingletons(DefaultSingletonBeanRegistry.java:520)
at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingletons(DefaultListableBeanFactory.java:1154)
at org.springframework.context.support.AbstractApplicationContext.destroyBeans(AbstractApplicationContext.java:1106)
at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1075)
at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.doClose(ServletWebServerApplicationContext.java:172)
at org.springframework.context.support.AbstractApplicationContext$1.run(AbstractApplicationContext.java:991)

SelectHyperlinks serial methods do not support all <a> tag selection

The following queries are failed:

document.selectHyperlinks('[href=/dp/]')
ele.selectHyperlinks('[href=/dp/]')

the following queiries are supported by chrome devtools, bug not sure they are standard or not, they are also failed:

document.selectHyperlinks('[href*=/dp/]')
ele.selectHyperlinks('[href*=/dp/]')

Handle non-standard css selectors

Some websites use selectors what do not match the standard. For example,

<div class='KAHaP+'></div>

the charactor "+" is not allowed in a class name so Jsoup throws a SelectorParseException, and pulsar-dom throws a PowerSelectorParseException.

We found the issue when handle with jd.com and shopee.sg.

Jsoup follows the CSS2 value defination standard:
https://www.w3.org/TR/CSS2/syndata.html#value-def-identifier

In CSS, identifiers (including element names, classes, and IDs in [selectors](https://www.w3.org/TR/CSS2/selector.html)) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B\&W\?" or "B\26 W\3F".

For more about valid characters in a CSS selector:
https://pineco.de/css-quick-tip-the-valid-characters-in-a-custom-css-selector/
A selector will look something like this:
-?[_a-zA-Z]+[_-a-zA-Z0-9]*

Failed to create web driver pulsar_chrome, caused by "Using unsafe HTTP verb GET to invoke /json/new. This action supports only PUT verb."

Failed to create web driver pulsar_chrome, caused by "Using unsafe HTTP verb GET to invoke /json/new. This action supports only PUT verb."

15:39:53.165 [r-worker-2] INFO a.p.pulsar.common.ProcessLauncher - Launching process:
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --headless --disable-gpu --hide-scrollbars --remote-debugging-port=0 --no-default-browser-check --no-first-run --no-startup-window --mute-audio --disable-background-networking --disable-background-timer-throttling --disable-client-side-phishing-detection --disable-hang-monitor --disable-popup-blocking --disable-prompt-on-repost --disable-sync --disable-translate --disable-blink-features=AutomationControlled --metrics-recording-only --safebrowsing-disable-auto-update --no-sandbox --ignore-certificate-errors --window-size=1920,1080 --pageLoadStrategy=none --throwExceptionOnScriptError=true --user-data-dir=/var/folders/vr/_8xgwfn14959gb617jpn7gv40000gp/T/pulsar-kust/context/browser/br.2jede
15:39:53.487 [r-worker-2] ERROR a.p.p.p.b.driver.WebDriverFactory - Failed to create web driver pulsar_chrome
ai.platon.pulsar.protocol.browser.DriverLaunchException: Failed to create chrome devtools driver
at ai.platon.pulsar.protocol.browser.driver.cdt.ChromeDevtoolsDriver.(ChromeDevtoolsDriver.kt:110)
at ai.platon.pulsar.protocol.browser.driver.WebDriverFactory.createChromeDevtoolsDriver(WebDriverFactory.kt:80)
at ai.platon.pulsar.protocol.browser.driver.WebDriverFactory.create(WebDriverFactory.kt:44)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.createDriverIfNecessary(LoadingWebDriverPool.kt:226)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.poll0(LoadingWebDriverPool.kt:204)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.poll(LoadingWebDriverPool.kt:118)
at ai.platon.pulsar.protocol.browser.driver.LoadingWebDriverPool.poll(LoadingWebDriverPool.kt:113)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.firstLaunch(WebDriverPoolManager.kt:255)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.access$firstLaunch(WebDriverPoolManager.kt:40)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager$run0$2.invokeSuspend(WebDriverPoolManager.kt:211)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager$run0$2.invoke(WebDriverPoolManager.kt)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager$run0$2.invoke(WebDriverPoolManager.kt)
at ai.platon.pulsar.common.PreemptChannelSupport.whenNormalDeferred(PreemptChannelSupport.kt:59)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.run0(WebDriverPoolManager.kt:194)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.run(WebDriverPoolManager.kt:105)
at ai.platon.pulsar.protocol.browser.driver.WebDriverPoolManager.run(WebDriverPoolManager.kt:101)
at ai.platon.pulsar.protocol.browser.emulator.context.WebDriverContext.run(BrowserContexts.kt:60)
at ai.platon.pulsar.protocol.browser.emulator.context.BrowserPrivacyContext.doRun$suspendImpl(BrowserPrivacyContext.kt:43)
at ai.platon.pulsar.protocol.browser.emulator.context.BrowserPrivacyContext.doRun(BrowserPrivacyContext.kt)
at ai.platon.pulsar.crawl.fetch.privacy.PrivacyContext.run$suspendImpl(PrivacyContext.kt:118)
at ai.platon.pulsar.crawl.fetch.privacy.PrivacyContext.run(PrivacyContext.kt)
at ai.platon.pulsar.protocol.browser.emulator.context.MultiPrivacyContextManager.run0(MultiPrivacyContextManager.kt:118)
at ai.platon.pulsar.protocol.browser.emulator.context.MultiPrivacyContextManager.run(MultiPrivacyContextManager.kt:101)
at ai.platon.pulsar.protocol.browser.emulator.context.MultiPrivacyContextManager.run(MultiPrivacyContextManager.kt:54)
at ai.platon.pulsar.protocol.browser.emulator.BrowserEmulatedFetcher.fetchTaskDeferred(BrowserEmulatedFetcher.kt:76)
at ai.platon.pulsar.protocol.browser.emulator.BrowserEmulatedFetcher.fetchContentDeferred(BrowserEmulatedFetcher.kt:69)
at ai.platon.pulsar.protocol.browser.BrowserEmulatorProtocol.getResponseDeferred(BrowserEmulatorProtocol.kt:49)
at ai.platon.pulsar.crawl.protocol.http.AbstractHttpProtocol.getProtocolOutputDeferred$suspendImpl(AbstractHttpProtocol.kt:101)
at ai.platon.pulsar.crawl.protocol.http.AbstractHttpProtocol.getProtocolOutputDeferred(AbstractHttpProtocol.kt)
at ai.platon.pulsar.crawl.component.FetchComponent.fetchContentDeferred0(FetchComponent.kt:133)
at ai.platon.pulsar.crawl.component.FetchComponent.fetchContentDeferred(FetchComponent.kt:95)
at ai.platon.pulsar.crawl.component.LoadComponent.fetchContentDeferred(LoadComponent.kt:442)
at ai.platon.pulsar.crawl.component.LoadComponent.fetchContentIfNecessaryDeferred(LoadComponent.kt:232)
at ai.platon.pulsar.crawl.component.LoadComponent.loadDeferred1(LoadComponent.kt:217)
at ai.platon.pulsar.crawl.component.LoadComponent.loadDeferred0(LoadComponent.kt:211)
at ai.platon.pulsar.crawl.component.LoadComponent.loadWithRetryDeferred(LoadComponent.kt:107)
at ai.platon.pulsar.crawl.component.LoadComponent.loadDeferred(LoadComponent.kt:94)
at ai.platon.pulsar.context.support.AbstractPulsarContext.loadDeferred$suspendImpl(AbstractPulsarContext.kt:326)
at ai.platon.pulsar.context.support.AbstractPulsarContext.loadDeferred(AbstractPulsarContext.kt)
at ai.platon.pulsar.session.AbstractPulsarSession.loadAndCacheDeferred(AbstractPulsarSession.kt:207)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred$suspendImpl(AbstractPulsarSession.kt:197)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred(AbstractPulsarSession.kt)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred$suspendImpl(AbstractPulsarSession.kt:190)
at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred(AbstractPulsarSession.kt)
at ai.platon.pulsar.crawl.StreamingCrawler.loadWithEventHandlers(StreamingCrawler.kt:520)
at ai.platon.pulsar.crawl.StreamingCrawler.loadUrl(StreamingCrawler.kt:416)
at ai.platon.pulsar.crawl.StreamingCrawler.runUrlTask(StreamingCrawler.kt:405)
at ai.platon.pulsar.crawl.StreamingCrawler.access$runUrlTask(StreamingCrawler.kt:68)
at ai.platon.pulsar.crawl.StreamingCrawler$runWithStatusCheck$2.invokeSuspend(StreamingCrawler.kt:379)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
Caused by: ai.platon.pulsar.browser.driver.chrome.util.WebSocketServiceException: Received error (405) - Method Not Allowed
Using unsafe HTTP verb GET to invoke /json/new. This action supports only PUT verb.
at ai.platon.pulsar.browser.driver.chrome.impl.Chrome.request(Chrome.kt:157)
at ai.platon.pulsar.browser.driver.chrome.impl.Chrome.createTab(Chrome.kt:66)
at ai.platon.pulsar.protocol.browser.driver.cdt.ChromeDevtoolsBrowserInstance.createTab(ChromeDevtoolsBrowserInstance.kt:45)
at ai.platon.pulsar.protocol.browser.driver.cdt.ChromeDevtoolsDriver.(ChromeDevtoolsDriver.kt:97)
... 54 common frames omitted
15:39:53.489 [r-worker-2] WARN a.p.pulsar.crawl.StreamingCrawler - Failed to create web driver | pulsar_chrome

The inactive privacy context was not closed properly.

The inactive privacy context was not closed properly.

13:00:14.018 [r-worker-1] INFO a.p.p.p.b.e.c.BrowserPrivacyContext - Privacy context #10102H5yL71 has lived for 2h59m33s | success: 1248(0.12 pages/s) | small: 1(0.1%) | traffic: 376.43 MiB(35.70 KiB/s) | tasks: 1290 total run: 1284 | null

About it be distribuited

Would be nice if it has at least a dockerfile in order to be distribuited at the concept level of the real word.

As well for security purposes of any kind.

At least.

Failed to get the total disk space

Exception in thread "DefaultDispatcher-worker-1 @sc#1" java.nio.file.FileSystemException: /run/user/1000/doc: Operation not permitted at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) at java.base/sun.nio.fs.UnixFileStore.readAttributes(UnixFileStore.java:115) at java.base/sun.nio.fs.UnixFileStore.getTotalSpace(UnixFileStore.java:122) at ai.platon.pulsar.common.metrics.AppMetrics$Companion.getFreeSpace(AppMetrics.kt:107)

`vincent@vincent-KLVC-WXX9:~/workspace/pulsar-1.10.x$ java -version
openjdk version "11.0.16" 2022-07-19
OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)

vincent@vincent-KLVC-WXX9:~$ uname -a
Linux vincent-KLVC-WXX9 5.15.0-53-generic #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
`

Should properly handle the files without correct permition:

FileSystems.getDefault().fileStores .filter { ByteUnitConverter.convert(totalSpaceOr0(it), "G") > 20 } .map { unallocatedSpaceOr0(it) } .filter { it > 0 }

Headless chrome has detectable user-agent

使用最新的Google Chrome时:

使用正常标题的google-chrome浏览器时:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36

使用google-chrome-headless浏览器时:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/95.0.4638.69 Safari/537.36

启动REST服务时报错

环境:windows11 ,jdk11,maven 3.6.3 / centos7 jdk11 maven3.9.2 两套系统
将代码gitclone到本地后,启动rest服务报错
issue

但是windows11的项目在idea中可以正常运行pulsar-app/pulsar-examples/里面的例子。
centos7按照文档上bin/build-run.sh也是报这个错误。
将版本换成1.10.22也报这个错误。
请问是怎么回事。

Too many warning logs after MongoDB crashes

Too many warning logs after MongoDB crashes:

10:56:03.617 [r-worker-5] WARN a.p.p.c.i.StreamingCrawler - [Unexpected]
ai.platon.shaded.com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ai.platon.shaded.com.mongodb.client.internal.MongoClientDeleg
ate$1@7335a5ec. Client view of cluster state is {type=STANDALONE, servers=[{address=127.0.0.1:27017, type=UNKNOWN, state=CONNECTING, exception={ai.platon.shaded.com.mongodb.Mongo
SocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused (Connection refused)}}]
at ai.platon.shaded.com.mongodb.internal.connection.BaseCluster.createTimeoutException(BaseCluster.java:408)
at ai.platon.shaded.com.mongodb.internal.connection.BaseCluster.selectServer(BaseCluster.java:123)
at ai.platon.shaded.com.mongodb.internal.connection.AbstractMultiServerCluster.selectServer(AbstractMultiServerCluster.java:54)
at ai.platon.shaded.com.mongodb.client.internal.MongoClientDelegate.getConnectedClusterDescription(MongoClientDelegate.java:152)
at ai.platon.shaded.com.mongodb.client.internal.MongoClientDelegate.createClientSession(MongoClientDelegate.java:102)
at ai.platon.shaded.com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.getClientSession(MongoClientDelegate.java:282)
at ai.platon.shaded.com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:206)
at ai.platon.shaded.com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:180)
at ai.platon.shaded.com.mongodb.DBCollection.executeWriteOperation(DBCollection.java:356)
at ai.platon.shaded.com.mongodb.DBCollection.update(DBCollection.java:588)
at ai.platon.shaded.com.mongodb.DBCollection.update(DBCollection.java:507)
at ai.platon.shaded.com.mongodb.DBCollection.update(DBCollection.java:482)
at ai.platon.shaded.com.mongodb.DBCollection.update(DBCollection.java:459)
at ai.platon.shaded.com.mongodb.DBCollection.update(DBCollection.java:527)
at org.apache.gora.mongodb.store.MongoStore.performPut(MongoStore.java:380)
at org.apache.gora.mongodb.store.MongoStore.put(MongoStore.java:345)
at org.apache.gora.mongodb.store.MongoStore.put(MongoStore.java:70)
at ai.platon.pulsar.persist.WebDb.putInternal(WebDb.kt:134)
at ai.platon.pulsar.persist.WebDb.put(WebDb.kt:109)
at ai.platon.pulsar.persist.WebDb.put$default(WebDb.kt:109)
at ai.platon.pulsar.crawl.component.LoadComponent.persist(LoadComponent.kt:575)
at ai.platon.pulsar.crawl.component.LoadComponent.onLoaded(LoadComponent.kt:371)
at ai.platon.pulsar.crawl.component.LoadComponent.loadDeferred1(LoadComponent.kt:231)
at ai.platon.pulsar.crawl.component.LoadComponent.access$loadDeferred1(LoadComponent.kt:41)
at ai.platon.pulsar.crawl.component.LoadComponent$loadDeferred1$1.invokeSuspend(LoadComponent.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.internal.ScopeCoroutine.afterResume(Scopes.kt:33)
at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)

mvn install时报错

我用的ubuntu 22.04,maven 3.9.6, java 19.0.2。
在下载后的文件根目录里用mvn install 上传项目,中间一直报错,调了蛮久都没解决。
报错如图:
2023-12-21_17-42
2023-12-21_17-41
2023-12-21_17-41_1

请问可能是什么原因?我之前是maven 3.6,报错后按照issues里的说的更新maven后依然报错。kotlin编译器也安装好了,有没有可能是缺少了什么组件?

Too many RobustRPC logs

Too many RobustRPC logs, for example:

2023-09-12 14:07:43.898 INFO [-worker-37] a.p.p.p.b.d.c.d.RobustRPC - [scrollTo] (3/5) | -32000, DOM Error while querying
2023-09-12 14:07:44.072 INFO [-worker-51] a.p.p.p.b.d.c.d.RobustRPC - [scrollTo] (3/5) | -32000, DOM Error while querying
2023-09-12 14:07:45.095 INFO [-worker-23] a.p.p.p.b.d.c.d.RobustRPC - [scrollTo] (3/5) | -32000, DOM Error while querying

Function "DOM_LOCATION" not found; SQL statement

I use StaticH2SQLContext().executeQuery(sql) in ide, it work complete.
but when I use mvn package and java -jar to start project. it throw this problem:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:108)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:58)
at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:88)
Caused by: org.h2.jdbc.JdbcSQLException: Function "DOM_LOCATION" not found; SQL statement:
select xxx from xxx
at org.h2.message.DbException.getJdbcSQLException(DbException.java:357)
at org.h2.message.DbException.get(DbException.java:179)
at org.h2.message.DbException.get(DbException.java:155)
at org.h2.command.Parser.readJavaFunction(Parser.java:2699)
at org.h2.command.Parser.readFunction(Parser.java:2756)
at org.h2.command.Parser.readTerm(Parser.java:3102)
at org.h2.command.Parser.readFactor(Parser.java:2587)
at org.h2.command.Parser.readSum(Parser.java:2574)
at org.h2.command.Parser.readConcat(Parser.java:2544)
at org.h2.command.Parser.readCondition(Parser.java:2370)
at org.h2.command.Parser.readAnd(Parser.java:2342)
at org.h2.command.Parser.readExpression(Parser.java:2334)
at org.h2.command.Parser.parseSelectSimpleSelectPart(Parser.java:2245)
at org.h2.command.Parser.parseSelectSimple(Parser.java:2277)
at org.h2.command.Parser.parseSelectSub(Parser.java:2133)
at org.h2.command.Parser.parseSelectUnion(Parser.java:1946)
at org.h2.command.Parser.parseSelect(Parser.java:1919)
at org.h2.command.Parser.parsePrepared(Parser.java:463)
at org.h2.command.Parser.parse(Parser.java:335)
at org.h2.command.Parser.parse(Parser.java:307)
at org.h2.command.Parser.prepareCommand(Parser.java:278)
at org.h2.engine.Session.prepareLocal(Session.java:626)
at org.h2.engine.Session.prepareCommand(Session.java:564)
at org.h2.jdbc.JdbcConnection.prepareCommand(JdbcConnection.java:1247)
at org.h2.jdbc.JdbcStatement.executeQuery(JdbcStatement.java:78)
at ai.platon.pulsar.ql.context.AbstractSQLContext.executeQuery(AbstractSQLContext.kt:89)
... 10 more

when I use ScentSQLContext.create() to package , it work complete, But SQLContexts.create() is error.
How can I use SQLContext to package and work successfully ?

windows打包配置,历经千难万险,总算成功打包

 <mirror>
    <id>aliyunmaven</id>
    <mirrorOf>central</mirrorOf>
    <name>阿里云公共仓库</name>
    <url>https://maven.aliyun.com/repository/public</url>
</mirror>
<mirror>
    <id>spring</id>
    <mirrorOf>central</mirrorOf>
    <name>spring公共仓库</name>
  <url>https://maven.aliyun.com/repository/spring</url>
</mirror>
 <mirror>
    <id>repo</id>
    <mirrorOf>central</mirrorOf>
    <name>Human Readable Name for this Mirror.</name>
    <url>https://repo.maven.apache.org/maven2/</url>
</mirror>
<mirror>
    <id>repo2</id>
    <mirrorOf>central</mirrorOf>
    <name>Human Readable Name for this Mirror.</name>
    <url>https://oss.sonatype.org/#stagingRepositories</url>
</mirror>
<mirror>
    <id>repo3</id>
    <mirrorOf>central</mirrorOf>
    <name>Human Readable Name for this Mirror.</name>
    <url>https://repo1.maven.org/maven2/ai/platon/pulsar</url>
</mirror>
<mirror>
    <id>platonic</id>
    <mirrorOf>public</mirrorOf>
    <name>platonic公共仓库</name>
  <url>http://static.platonic.fun/repo/</url>
</mirror>
<mirror>
    <id>maven-default-http-blocker</id>
    <mirrorOf>dummy</mirrorOf>
    <name>Dummy mirror to override default blocking mirror that blocks http</name>
    <url>http://0.0.0.0/</url>
</mirror>

Failed to load proxy.providers.txt by multiple threads in parallel. The file should be locked.

Failed to load proxy.providers.txt by multiple threads in parallel. The file should be locked.

17:02:17.125 [-worker-33] WARN a.p.p.c.proxy.ProxyLoader - Failed to load - /home/platonai/.pulsar/proxy/providers-enabled/proxy.providers.txt
17:02:17.125 [-worker-30] WARN a.p.p.c.proxy.ProxyLoader - Failed to load - /home/platonai/.pulsar/proxy/providers-enabled/proxy.providers.txt
17:02:17.125 [r-worker-8] WARN a.p.p.c.proxy.ProxyLoader - Failed to load - /home/platonai/.pulsar/proxy/providers-enabled/proxy.providers.txt

Problem with driver.allTexts()

driver.allTexts() 这个方法返回的是List ,但是我调用后,结果并不是,好像list中嵌套了另一层list

val logisticsInfoList=driver.allTexts(".logistics-info-mod__header___2_fWN")
println("logisticsInfoList="+logisticsInfoList)

打印结果是 logisticsInfoList=[["菜鸟直送(丹鸟KD):621089810336681","申通快递:773260085378001"]]

User specified chrome path

Pulsar have to know the chrome home to play with it, a user specified path is required.

By default, pulsar searches the following paths for google chrome:

val CHROME_BINARY_SEARCH_PATHS = arrayOf(
"/usr/bin/google-chrome-stable",
"/usr/bin/google-chrome",
"/opt/google/chrome/chrome",
"C:/Program Files (x86)/Google/Chrome/Application/chrome.exe",
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
"/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary",
"/Applications/Chromium.app/Contents/MacOS/Chromium",
"/usr/bin/chromium",
"/usr/bin/chromium-browser"
)

build fail at ubuntu 18.04

bin/build.sh ->

[INFO] Running ai.platon.pulsar.common.sql.TestSQLTemplate
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 s - in ai.platon.pulsar.common.sql.TestSQLTemplate
[INFO]
[INFO] Results:
[INFO]
[ERROR] Errors:
[ERROR] TestAppRuntimes.testDeleteBrokenSymbolicLinksUsingJava:92->testDeleteBrokenSymbolicLinksUsingJava$lambda-14:92 » FileSystem
[INFO]
[ERROR] Tests run: 149, Failures: 0, Errors: 1, Skipped: 8
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Pulsar 1.8.4-SNAPSHOT:
[INFO]
[INFO] Pulsar ............................................. SUCCESS [ 1.669 s]
[INFO] Pulsar Common ...................................... FAILURE [ 44.509 s]
[INFO] Pulsar Third ....................................... SKIPPED

Connection timed out when closing staging repository

Connection timed out when closing staging repository:

Uploaded to ossrh: https://oss.sonatype.org:443/service/local/staging/deployByRepositoryId/aiplatonpulsar-1066/ai/platon/pulsar/pulsar-protocol/1.10.16/pulsar-protocol-1.10.16.pom.asc (659 B at 805 B/s)
[INFO]  * Upload of locally staged artifacts finished.
[INFO]  * Closing staging repository with ID "aiplatonpulsar-1066".
[ERROR] Remote staging finished with a failure: java.net.SocketException: Connection timed out (Read failed)

NullPointerException in StreamingCrawler.handleCanceled.

Exception in thread "DefaultDispatcher-worker-7" java.lang.NullPointerException
at ai.platon.pulsar.crawl.impl.StreamingCrawler.handleCanceled(StreamingCrawler.kt:649)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.handleRetry(StreamingCrawler.kt:541)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.runLoadTaskWithEventHandlers(StreamingCrawler.kt:462)
at ai.platon.pulsar.crawl.impl.StreamingCrawler.access$runLoadTaskWithEventHandlers(StreamingCrawler.kt:68)
at ai.platon.pulsar.crawl.impl.StreamingCrawler$runLoadTaskWithEventHandlers$1.invokeSuspend(StreamingCrawler.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.internal.ScopeCoroutine.afterResume(Scopes.kt:33)
at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
at kotlinx.coroutines.internal.ScopeCoroutine.afterResume(Scopes.kt:33)
at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)

How to keep my login session?

The manual login method is:

  1. [Optional] Delete ~/.pulsar/browser
  2. Run OpenPrototypeChrome.kt
  3. Manually visit the target website and browse several more web pages to create a browsing context.
  4. All subsequent browser tasks will inherit the above browsing context.

Alternatively, copy the browser environment you use daily to the corresponding subdirectory under ~/.pulsar.

After copying, in directory ~/.pulsar/browser/chrome/prototype/google-chrome, there should be the following files:

PS C:\Users\pereg\.pulsar\browser\chrome\prototype\google-chrome> ls

    Directory: C:\Users\pereg\.pulsar\browser\chrome\prototype\google-chrome

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d----           2023/11/5    17:32                AutofillStates
d----          2023/12/14    16:23                BrowserMetrics
d----           2023/11/5    18:04                CertificateRevocation
d----           2023/12/4    22:36                component_crx_cache
d----          2023/10/27     9:43                Crashpad
d----           2023/11/1    16:22                Crowd Deny
d----          2023/12/14    16:25                Default
d----           2023/11/1    16:38                extensions_crx_cache
d----           2023/11/1    13:20                FileTypePolicies
d----           2023/11/5    17:32                FirstPartySetsPreloaded
d----          2023/10/27     9:43                GraphiteDawnCache
d----           2023/11/5    18:04                GrShaderCache
d----           2023/11/5    17:19                hyphen-data
d----           2023/11/1    16:41                Local Traces
d----          2023/10/27     9:43                MediaFoundationWidevineCdm
d----          2023/10/27     9:43                MEIPreload
d----          2023/10/27     9:43                OnDeviceHeadSuggestModel
d----           2023/12/6     9:51                OptimizationGuidePredictionModels
d----           2023/12/6     9:51                OptimizationHints
d----           2023/11/1    13:24                OriginTrials
d----           2023/12/6     9:51                PKIMetadata
d----          2023/10/31    17:34                pnacl
d----           2023/11/1    16:41                PnaclTranslationCache
d----           2023/11/5    17:32                PrivacySandboxAttestationsPreloaded
d----          2023/10/27     9:43                RecoveryImproved
d----           2023/11/1    16:38                Safe Browsing
d----           2023/12/6     9:51                SafetyTips
d----           2023/11/5    16:52                segmentation_platform
d----          2023/10/27     9:43                ShaderCache
d----          2023/10/31    19:48                SSLErrorAssistant
d----          2023/10/27    12:06                Subresource Filter
d----           2023/11/1    13:21                ThirdPartyModuleList64
d----           2023/11/5    16:52                TpcdMetadata
d----          2023/10/31    18:55                TrustTokenKeyCommitments
d----           2023/11/5    18:06                Webstore Downloads
d----          2023/10/27     9:43                WidevineCdm
d----           2023/11/1    16:22                ZxcvbnData
-a---          2023/12/14    16:23             59 DevToolsActivePort
-a---           2023/11/5    18:05         451968 en-US-10-1.bdic
-a---           2023/11/1    16:37              0 First Run
-a---           2023/11/5    18:08          57344 first_party_sets.db
-a---           2023/11/5    18:08              0 first_party_sets.db-journal
-a---           2023/12/4    22:33            106 Last Browser
-a---          2023/12/14    16:23             13 Last Version
-a---          2023/12/14    16:24          77401 Local State
-a---          2023/12/14    16:23             87 Variations

Originally posted by @galaxyeye in #51 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.