Code Monkey home page Code Monkey logo

yo's Introduction

Yo!

Micro Web Crawler in PHP & Manticore

Yo! is the super thin client-server crawler based on Manticore full-text search.
Compatible with different networks, includes flexible settings, history snaps, CLI tools and adaptive JS-less UI.

Available alternative branch for Gemini Protocol!

Features

  • MIME-based crawler with flexible filter settings by regular expressions, selectors, external links etc
  • Page snap history with local and remote mirrors support (including FTP protocol)
  • CLI tools for index administration and crontab tasks
  • JS-less frontend to run local or public search web portal

Components

Install

Environment

Debian
  • wget https://repo.manticoresearch.com/manticore-repo.noarch.deb
  • dpkg -i manticore-repo.noarch.deb
  • apt update
  • apt install git composer manticore manticore-extra php-fpm php-curl php-mbstring php-gd

Yo search engine uses Manticore as the primary database. If your server sensitive to power down, change default binlog flush strategy to binlog_flush = 1

Deployment

Project in development, to create new search project, use dev-main branch:

  • composer create-project yggverse/yo:dev-main

Development

  • git clone https://github.com/YGGverse/Yo.git
  • cd Yo
  • composer update
  • git checkout -b pr-branch
  • git commit -m 'new fix'
  • git push

Update

  • cd Yo
  • git pull
  • composer update

Init

  • cp example/config.json config.json
  • php src/cli/index/init.php

Usage

  • php src/cli/document/add.php URL
  • php src/cli/document/crawl.php
  • php src/cli/document/search.php '*'

Web UI

  1. cd src/webui
  2. php -S 127.0.0.1:8080
  3. open http://127.0.0.1:8080 in browser

Documentation

CLI

Index

Init

Create initial index

php src/cli/index/init.php [reset]
  • reset - optional, reset existing index
Alter

Change existing index

php src/cli/index/alter.php {operation} {column} {type}
  • operation - operation name, supported values: add|drop
  • column - target column name
  • type - target column type, supported values: text|integer

Document

Add
php src/cli/document/add.php URL
  • URL - add new URL to the crawl queue
Crawl
php src/cli/document/crawl.php
Clean

Make index optimization, apply new configuration rules

php src/cli/document/clean.php [limit]
  • limit - integer, documents quantity per queue
Search
php src/cli/document/search.php '@title "*"' [limit]
  • query - required
  • limit - optional search results limit
Migration
YGGo

Import index from YGGo database

php src/cli/yggo/import.php 'host' 'port' 'user' 'password' 'database' [unique=off] [start=0] [limit=100]

Source DB fields required:

  • host
  • port
  • user
  • password
  • database
  • unique - optional, check for unique URL (takes more time)
  • start - optional, offset to start queue
  • limit - optional, limit queue

Backup

Logical

SQL text dumps could be useful for public index distribution, but requires more computing resources.

Read more

Physical

Better for infrastructure administration and includes original data binaries.

Read more

Instances

  • http://[201:23b4:991a:634d:8359:4521:5576:15b7]/yo/ - IPv6 0200::/7 addresses only | index
    • http://yo.ygg
    • http://yo.ygg.at
    • http://ygg.yo.index

yo's People

Contributors

d47081 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

yo's Issues

Add index dump tool

Some automated tool for distributed index dump for following entities:

  • DB
  • Snaps

Probably with torrent generation trough yggtracker API.

Useful for for nodes with open indexes available to download

Check DNS resolver connectivity

Currently, network connection test implemented by IP only (to process new status codes in crawl queue)

For example, yggdrasil instance uses external Alfis DNS resolver that may be offline even crawler connected the network.
In this case all pages in queue may get 404 codes.

Limits per host setting

Would be nice to have some option to limit pages quantity for some resources

something like in the crawl section

'host' => pages_quantity

Crawler optimization

Replace search query of crc32url ID to prevent full-text search queries in crawler

getDocumentById in manticoresoftware/manticoresearch-php/src/Manticoresearch/Index.php

fix duplicates by cleaner tool

cleanup action required for previous indexes collected

http://[201:23b4:991a:634d:8359:4521:5576:15b7]/yo/search.php?q=http%3A%2F%2F%5B320%3A9c1%3Ae1fa%3Aa105%3A%3Ad%5D%2Fyggdrasil%3Abittorrent%3Aqbittorrent%3Frev%3D1699915157%26do%3Ddiff

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.