Code Monkey home page Code Monkey logo

apify / actor-legacy-phantomjs-crawler Goto Github PK

View Code? Open in Web Editor NEW
6.0 4.0 4.0 1.02 MB

The actor implements the legacy Apify Crawler product. It uses PhantomJS headless browser to recursively crawl websites and extract data from them using a piece of JavaScript code.

Home Page: https://apify.com/apify/legacy-phantomjs-crawler

Dockerfile 0.14% HTML 0.82% CSS 0.82% JavaScript 98.23%
phantomjs web-scraping apify web-crawler headless-browsers

actor-legacy-phantomjs-crawler's People

Contributors

davidjohnbarton avatar dependabot[bot] avatar fnesveda avatar jancurn avatar mtrunkat avatar vbartonicek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

actor-legacy-phantomjs-crawler's Issues

Legacy Actor No Longer generating clean datasets

The clean datasets created by the Legacy actor no longer only include the PageFunction results but include all request data.

Use Case:

  1. create actor using github https://github.com/apifytech/actor-legacy-phantomjs-crawler.git#master
  2. create task
  3. run task
  4. view clean and full json

See: https://api.apify.com/v2/datasets/wKr4JiZM3hvrR6fPc/items?format=json&clean=1

[{
  "loadedUrl": "https://www.enjoylifenotfood.com/",
  "requestedAt": "2020-01-07T12:48:26.821Z",
  "loadingStartedAt": "2020-01-07T12:48:27.129Z",
  "loadingFinishedAt": "2020-01-07T12:48:27.374Z",
  "loadErrorCode": null,
  "pageFunctionStartedAt": "2020-01-07T12:48:27.821Z",
  "pageFunctionFinishedAt": "2020-01-07T12:48:27.824Z",
  "type": "StartUrl",
  "isMainFrame": true,
  "willLoad": true,
  "label": "START",
  "referrerId": null,
  "depth": 0,
  "pageFunctionResult": [
    {
      "url": "https://www.enjoylifenotfood.com/",
      "pageTitle": "Enjoy Life Not Food | web-enjoylifenotfood"
    },
    {
      "url": "https://www.enjoylifenotfood.com/",
      "pageTitle": "Enjoy Life Not Food | web-enjoylifenotfood"
    }
  ],
  "downloadedBytes": 14472,
  "queuePosition": "LAST",
  "responseStatus": 200,
  "responseHeaders": {
    "Server": "GitHub.com",
    "Content-Type": "text/html; charset=utf-8",
    "Last-Modified": "Thu, 02 Jan 2020 15:17:01 GMT",
    "ETag": "W/\"5e0e096d-864\"",
    "Access-Control-Allow-Origin": "*",
    "Expires": "Tue, 07 Jan 2020 12:48:18 GMT",
    "Cache-Control": "max-age=600",
    "Content-Encoding": "gzip",
    "X-Proxy-Cache": "MISS",
    "X-GitHub-Request-Id": "D5D4:42C7:10A734E:16AAE9F:5E147BB9",
    "Accept-Ranges": "bytes",
    "Date": "Tue, 07 Jan 2020 12:48:27 GMT",
    "Via": "1.1 varnish",
    "Age": "531",
    "Connection": "keep-alive",
    "X-Served-By": "cache-bwi5064-BWI",
    "X-Cache": "HIT",
    "X-Cache-Hits": "1",
    "X-Timer": "S1578401307.218639,VS0,VE0",
    "Vary": "Accept-Encoding",
    "X-Fastly-Request-ID": "baf21d340f97644219b3e111529d4bc453a2ff78"
  },
  "id": "6uvRRK9EWFuVZVF",
  "url": "https://www.enjoylifenotfood.com/",
  "uniqueKey": "https://www.enjoylifenotfood.com",
  "method": "GET",
  "postData": null,
  "_retryCount": 0,
  "proxy": null
},
{
  "loadedUrl": "https://www.nichd.nih.gov/",
  "requestedAt": "2020-01-07T12:48:26.823Z",
  "loadingStartedAt": "2020-01-07T12:48:28.573Z",
  "loadingFinishedAt": "2020-01-07T12:48:29.808Z",
  "loadErrorCode": null,
  "pageFunctionStartedAt": "2020-01-07T12:48:29.826Z",
  "pageFunctionFinishedAt": "2020-01-07T12:48:29.903Z",
  "type": "StartUrl",
  "isMainFrame": true,
  "willLoad": true,
  "label": null,
  "referrerId": null,
  "depth": 0,
  "pageFunctionResult": [
    {
      "url": "https://www.nichd.nih.gov",
      "pageTitle": "Homepage | NICHD - Eunice Kennedy Shriver National Institute of Child Health and Human Development"
    },
    {
      "url": "https://www.nichd.nih.gov",
      "pageTitle": "Homepage | NICHD - Eunice Kennedy Shriver National Institute of Child Health and Human Development"
    }
  ],
  "downloadedBytes": 2913768,
  "queuePosition": "LAST",
  "responseStatus": 200,
  "responseHeaders": {
    "Date": "Tue, 07 Jan 2020 06:06:08 GMT",
    "Cache-Control": "max-age=31536000, public",
    "X-Drupal-Dynamic-Cache": "MISS",
    "Link": "<https://www.nichd.nih.gov/>; rel=\"shortlink\", <https://www.nichd.nih.gov/>; rel=\"canonical\"",
    "X-UA-Compatible": "IE=edge",
    "Content-language": "en",
    "X-Content-Type-Options": "nosniff",
    "X-Frame-Options": "SAMEORIGIN",
    "Expires": "Sun, 19 Nov 1978 05:00:00 GMT",
    "Last-Modified": "Tue, 07 Jan 2020 06:05:24 GMT",
    "ETag": "\"1578377124-gzip\"",
    "Vary": "Cookie,Accept-Encoding",
    "X-Drupal-Cache": "HIT",
    "Content-Encoding": "gzip",
    "Content-Type": "text/html; charset=UTF-8",
    "X-Varnish": "15315436 16257349",
    "Age": "24140",
    "Via": "1.1 varnish-v4, 1.1 dca1-bit27",
    "X-Varnish-Cache": "HIT",
    "Accept-Ranges": "bytes",
    "Connection": "keep-alive",
    "Strict-Transport-Security": "max-age=31536000; preload",
    "X-XSS-Protection": "1; mode=block",
    "Set-Cookie": "TS01b2e53e=010193553fcc618575fdda5d265d60ae6a4a9b74cabe9ff75c6d5cb0060f10648213b817ef0cc130e37d4574ced25235908fc5b1a0; Path=/; Secure; HTTPOnly"
  },
  "id": "S3UIH4WWMnlqgIs",
  "url": "https://www.nichd.nih.gov",
  "uniqueKey": "https://www.nichd.nih.gov",
  "method": "GET",
  "postData": null,
  "_retryCount": 0,
  "proxy": null
}]

PhantomCrawler: Unhandled exception - Cannot mark request as handled, because it is not in progress!

The crawler is crashing and causing the task to fail. The problem seems to be related to network timeout and the PageManager.

2020-03-22T10:27:24.364Z [S0000005] ERROR: RemoteRequestManager.webPage.onResourceTimeout(): {
2020-03-22T10:27:24.366Z              "errorCode": 408,
2020-03-22T10:27:24.369Z              "errorString": "Network timeout on resource.",
2020-03-22T10:27:24.371Z              "headers": [
2020-03-22T10:27:24.374Z                {
2020-03-22T10:27:24.376Z                  "name": "Accept",
2020-03-22T10:27:24.378Z                  "value": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
2020-03-22T10:27:24.380Z                },
2020-03-22T10:27:24.383Z                {
2020-03-22T10:27:24.386Z                  "name": "Origin",
2020-03-22T10:27:24.388Z                  "value": "null"
2020-03-22T10:27:24.390Z                },
2020-03-22T10:27:24.394Z                {
2020-03-22T10:27:24.396Z                  "name": "User-Agent",
2020-03-22T10:27:24.399Z                  "value": "Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) PhantomJS/2.1.1s-apifier Safari/538.1"
2020-03-22T10:27:24.401Z                },
2020-03-22T10:27:24.403Z                {
2020-03-22T10:27:24.406Z                  "name": "Content-Type",
2020-03-22T10:27:24.408Z                  "value": "application/json"
2020-03-22T10:27:24.410Z                },
2020-03-22T10:27:24.413Z                {
2020-03-22T10:27:24.415Z                  "name": "Content-Length",
2020-03-22T10:27:24.417Z                  "value": "54"
2020-03-22T10:27:24.419Z                }
2020-03-22T10:27:24.421Z              ],
2020-03-22T10:27:24.424Z              "id": 19,
2020-03-22T10:27:24.426Z              "method": "POST",
2020-03-22T10:27:24.428Z              "postData": "{\"messageType\":\"dummy\",\"piggybackBufferedRequests\":[]}",
2020-03-22T10:27:24.431Z              "time": "2020-03-22T10:26:34.395Z",
2020-03-22T10:27:24.433Z              "url": "http://localhost:34519/slave/5"
2020-03-22T10:27:24.437Z            }
2020-03-22T10:27:24.439Z [S0000005] ERROR: RemoteRequestManager.webPage.onResourceError(): {
2020-03-22T10:27:24.441Z              "errorCode": 5,
2020-03-22T10:27:24.443Z              "errorString": "Operation canceled",
2020-03-22T10:27:24.445Z              "id": 19,
2020-03-22T10:27:24.448Z              "status": null,
2020-03-22T10:27:24.450Z              "statusText": null,
2020-03-22T10:27:24.452Z              "url": "http://localhost:34519/slave/5"
2020-03-22T10:27:24.454Z            }
2020-03-22T10:27:24.457Z [S0000005] ERROR: Couldn't send message to control server at 'http://localhost:34519/slave/5' (messageType=dummy, status: 'fail'):
2020-03-22T10:27:24.460Z [S0000005] A fatal error occurred, shutting down...
2020-03-22T10:27:24.463Z INFO: Slave exited {"slaveId":5,"pid":36,"code":232,"signal":null}
2020-03-22T10:27:24.465Z INFO: Reclaiming request to queue, it will be retried again {"requestId":"bzqn4DDibYTEBp2","slaveId":5}
2020-03-22T10:27:25.047Z [S0000018] Loading crawler configuration from: /tmp/tmp-6M1q6QBgl1of3/config.json
2020-03-22T10:27:25.049Z [S0000018] WARNING: No 'crawlPurls' specified in the configuration!
2020-03-22T10:27:25.051Z [S0000018] Starting crawler using RemoteRequestManager (URL: http://localhost:34519/slave/18, bootstrap: undefined)...
2020-03-22T10:27:43.100Z INFO: Reclaiming request to queue, it will be retried again {"requestId":"ywWZUDAYrQ3e7qq"}
2020-03-22T10:27:43.189Z ERROR: PhantomCrawler: Unhandled exception
2020-03-22T10:27:43.192Z   Error: Cannot mark request Fy4QLsrdYTqEB1R as handled, because it is not in progress!
2020-03-22T10:27:43.194Z     at RequestQueue.markRequestHandled (/home/myuser/node_modules/apify/build/request_queue.js:431:13)
2020-03-22T10:27:43.196Z     at PageManager.markRequestHandled (/home/myuser/src/page_manager.js:290:52)
2020-03-22T10:27:43.200Z     at runMicrotasks (<anonymous>)
2020-03-22T10:27:43.202Z     at processTicksAndRejections (internal/process/task_queues.js:97:5)
2020-03-22T10:27:43.204Z     at async PhantomCrawler._handleNextTaskFromSlave (/home/myuser/src/phantom_crawler.js:735:21)

Set OUTPUT from actor

We were discussing that similarly to input schema, actors could define output schema which would present some nice UI to the results of the actor. So this could be the first step, to generate something to output. The question is, what should it be? Probably a link to full and simplified results, and perhaps a first few records from the output, to enable synchronous calling.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.