Code Monkey home page Code Monkey logo

semantic_scraper's Introduction

Semantic Collector

Semantic Collector (SemCollect) is a tool to collect data from websites, for use cases like Retrieval Augmented Generation.

Basic usage

This hasn't been released on the web store yet, so for now this browser extension has to be loaded unpacked, in dev mode.

Clone this repo, then load unpacked, then:

  1. Go to chrome://extensions
  2. Enable Developer mode
  3. Load unpacked
  4. Open the service worker logs by clicking the Inspect view link (important to validate what's going on)
  5. Save some content. Download. Click clear when you're done to reset internal state.

Collect & Crawl

Clear Data, click Collect & Crawl, then sit back and watch. When things stop, click Download to save results.

You must leave the popup in focus while crawl collection is happening. You cannot use your browser while this operation is running.

The collector is hardcoded to avoid links that cross over to new hostnames, however, in the case of redirects we may end up on a new host. Therefore, it is strongly recommended to use the Link Inclusion Pattern regex, e.g. https://chamomile\.ai/.*.

For extra protection, add specific URLs to exclude, e.g. those that are known to redirect.

Limitations

This codebase is very new and has some limitations. Some limitations will be easy to solve, and some will be harder.

For limitations that require code changes, we're more than happy to accept pull requests for anything in the easy section. For things in the hard section please get in touch first by filing an issue, to discuss the approach.

Easy to solve

  • No crawl progress indicator (a practical approach would be to update links crawled/to crawl on the popup ui)
  • Can't save link inclusion/exclusion specifications to repeat the same jobs
  • Not tested at all on Edge (only tested on Chrome)
  • Very limited testing on complex sites
  • No metadata in the output file; we may, for instance, want to include collection date

Harder to solve

  • No scroll-to-complete-load functiionality

semantic_scraper's People

Contributors

marklar-co avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.