Code Monkey home page Code Monkey logo

web-crawler's Introduction

The Itsy Bitsy Spider

Features:

This web crawler allows you to search for a string within websites crawled from all over the web. The crawling modules gathers information about websites, including their titles, meta descriptions and content, and persistently stores them in a database. Users may then search for a certain string using the intuitive search interface. In addition, compliance to the robots.txt files of websites is ensured, with the content of only those files being saved in the database which are allowed to be crawled. Errors, including the robots.txt file not being found or parsed correctly, and the cURL function not returning valid HTML content, etc, have been appropriately handled to ensure the functioning of the crawler is not affected.

Technologies Used

Languages:

  • PHP: Used for server-side scripting to handle the backend logic, interact with the database, and generate dynamic content.
  • HTML: Utilized for creating the structure and content of the web pages, defining the user interface elements.
  • SQL: Employed for managing and querying the MySQL database, handling data storage and retrieval.

Libraries:

  • cURL (Client URL): Used for making HTTP requests to fetch the HTML content of web pages. It facilitates the web crawling process by retrieving information from external websites.
  • DOMDocument: Utilized for HTML parsing, allowing the extraction of specific elements and content from the fetched web pages. This is essential for analyzing and storing relevant data.

Database:

  • MySQL: Chosen as the relational database management system to store crawled data. It provides a structured and efficient way to organize information, making it easily accessible for retrieval and analysis.

Styling:

  • Bootstrap: Implemented for styling and layout purposes, ensuring a consistent and visually appealing user interface. Bootstrap's responsive design elements enhance the application's accessibility across various devices and screen sizes.

Setting Up and Running the Spider

Prerequisites

  • PHP (v.8.3.0 or above)
  • XAMPP Server
  1. Download the zipped code folder or clone this repository:

    git clone [https://github.com/aaminaa01/web-crawler.git]
  2. Put the downloaded or cloned (unzipped) folder in the htdocs folder (inside the xampp folder) e.g. on my machine the file path after putting this code folder in htdocs (present in D://) will be: D:\xampp\htdocs\web-crawler-main.

  3. Run the XAMPP server.

  4. Open any browser, and to setup the spider and crawl content from the seed URL, type the following into the search bar:

    http://localhost/web-crawler-main/index.php
  5. Now to search for any string within the crawled content, type the following into the search bar:

    http://localhost/web-crawler-main/home.html
  6. You can now search for strings in the crawled content.

Please note that the current depth is set to two levels down and maximum execution time is set to 1000 seconds. Users wishing to change these values may do so by modifying the variables $time_limit and $depth_limit in index.php.

User Interface

1. index.php

image

2. home.html

homepg_starting

2. Entering a Query in the Search Bar

query2

3. Search Results

result2

4. Database Schema

schema

web-crawler's People

Contributors

aaminaa01 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.