Code Monkey home page Code Monkey logo

sitemap-generator's Introduction

Sitemap Generator

This script is used to generate a sitemap for a given website. It crawls the website recursively up to a specified depth and creates an XML sitemap which includes the URLs of the pages on the website.

Features

  • Recursive crawling
  • Multithreading for improved performance
  • Optional filtering by page last modification time
  • Respects robots.txt policies
  • Thread-safe
  • Generates sitemap in standard XML format

Requirements

  • Python 3.8 or higher

Dependencies

  • beautifulsoup4
  • lxml

Installation

  1. Clone the repository to your local machine:
git clone https://github.com/cnkang/sitemap-generator.git
  1. Navigate to the repository directory:
cd sitemap-generator

Usage

Before running the script, you need to configure the following global variables at the beginning of the script according to your requirements:

  • MAX_DEPTH: Maximum depth of links to traverse. Default is 10.
  • DOMAIN: Only links from this domain will be included in the sitemap. For example: 'www.example.com'.
  • MAX_WORKERS: Maximum number of threads used for parallel processing. Default is 5.
  • OUTPUT_FILENAME: The filename of the generated sitemap. Default is 'sitemap.xml'.
  • START_URL: The initial URL to start crawling from. For example: 'https://www.example.com/home'.
  • USE_TIME_FILTER: Whether to filter pages by modification time. Set to True or False. Default is True.
  • TIME_FILTER_THRESHOLD: Only include pages modified after this date. It is a datetime object. Default is datetime(2022, 9, 20, tzinfo=pytz.UTC).
  • RESPECT_ROBOTS_TXT: Whether to respect the website's robots.txt policies. Default is True.

Once you have configured these variables, you can run the script by executing it with Python:

python sitemap_generator.py

This will create a sitemap XML file with the name specified in OUTPUT_FILENAME.

Note

  • Be cautious with increasing the number of workers (MAX_WORKERS) as it can lead to aggressively crawling websites, which might be considered impolite or even violate some websites' terms of service.
  • Make sure the domain and starting URL are set correctly. Otherwise, the script will not generate a meaningful sitemap.
  • It is recommended to respect robots.txt policies to avoid potentially overloading the server or violating the website's crawling rules.

sitemap-generator's People

Contributors

cnkang avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

helloasir

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.