Code Monkey home page Code Monkey logo

jawa's Introduction

Jawa - Visual Scraper

npm Chrome Web Store

🇭🇷 Started in Croatia

DALL·E 2022-10-17 03 53 08 (2)

Visual scraper interface, exports to puppeteer script which you can run anywhere. You can try it out here https://jawa.sh

Jawa allows you to visually click elements of any website and then export selectors as a config that you can run in any node environment to scrape the content when needed.

This repo consists of the:

  • web app
  • cli
  • browser extension

Web app

Web app that provides embedded browser for visually selecting elements and creating the scraper config that you can download and run through the CLI or Cloud.

Cloud scraping (Beta)

It is now supported to run your scraper config in the cloud directly from web app. Cloud scrapers use the same Jawa CLI. Currently cloud scrapers have limited availability.

If you need more usage you can check out Jawa Pro.

CLI

Simple CLI to run configs created and exported from web app. You can run it like this:

npx jawa path/to/scraper/config/file.json

or npx jawa --help to see all the options.

jawa package now also exports scrape function so it can be used outside of CLI in your apps or services:

import { scrape } from 'jawa'
const { scrape } = require('jawa')

Browser extension

Browser extension that runs the embedded browser which powers the visual scraper interface.

It is available on:

  • Chrome Web Store
  • Chrome extensions also work on all Chromium based browsers like:
    • Opera
    • Microsoft Edge
    • Brave

jawa's People

Contributors

capjavert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

jawa's Issues

Add browse mode

User should be able to toggle "browse mode" where instead of picking selector on the right each click would act as if you were browsing the page in the normal browser.

This would enable easier navigation and scraping of multiple pages.

Jawa should auto remember and update current URL so that correct scraping metadata is collected while switching from and into browsing mode.

Add select all elements toggle

UI should expose the feature where Jawa can automatically remove :nth-child and similar grouping parts of the selectors to automatically widen out the elements that are scraped.

Example:
User is scraping a homepage of the blog or aggregation websites. There are multiple article/post items. User want's to scape all the titles on the page. If he selects the title in Jawa currently it will scrape only that one title. With this new feature he would be able to check "apply all" or similar UI element which would make the best guess to widen the selector to scrape all titles.

Caveat is that this can be pretty specific to some pages so it will not be the magic bullet but should enable less technical users to scrape content faster.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.