Comments (6)
Customer ordered indexing full website by sitemap.xml
from open-semantic-search.
Indexing a full website by sitemap.xml now implemented in command line ETL tool opensemanticsearch-index-sitemap.
For easier configuration/management of XML sitemaps i extended the "datasources" web UI for websites with the new tab "Sitemap".
from open-semantic-search.
Since open source, very powerful, implemented managment/prevention of loops and many bandwith management options, considering integration of / using Scrapy as default web crawler, which could add a task for etl_web for the downloaded file.
from open-semantic-search.
That'd be great - using scrapy. I don't remember if it can follow javascript or not. - perhaps via splash Many pages are generated using js. Another good web crawler (open source) is HTTrack.
from open-semantic-search.
A first beta is available as etl_web_crawl.py integrating Scrapy for crawling of full websites.
Todo: Integration with datasources web UI.
from open-semantic-search.
Implemented datasources web UI for crawling full websites (paths or domain) by Scrapy
from open-semantic-search.
Related Issues (20)
- Crawler configured by Datasources UI only crawls Startpage, although option "Crawl full domain..." HOT 2
- Is it possible to deactivated the standard solr tags like: currency, phone numbers, Money, law clause,... HOT 5
- bug regarding the mapping of file paths? HOT 1
- Allowing user access to search list and fuzzy search while limiting access to management tools HOT 1
- Several minor issues and fixes HOT 2
- Open semantic search Server
- Regularly stuck / hanging during extraction of files
- Docker repos not found HOT 2
- Indexing MacOS host filesystem on Docker Images HOT 2
- Docker build failed building wheel for spacy
- should improve performance plus add some large language model
- Indexing Problem
- /var/lib/opensemanticsearch/manage.py AttributeError: 'OntologyTagger' object has no attribute 'preferredLabel'
- deleted
- Change Default Facet Sorting in OpenSemanticSearch
- Correct recognized OCR data missing in search index HOT 1
- Unable to copy folder to another whilst Building HOT 2
- JPEG2000 in pdf
- I cannot seem to access the web page. localhost/search
- Which languages are supported? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from open-semantic-search.