This web crawler allows you to search for a string within websites crawled from all over the web. The crawling modules gathers information about websites, including their titles, meta descriptions and content, and persistently stores them in a database. Users may then search for a certain string using the intuitive search interface. In addition, compliance to the robots.txt files of websites is ensured, with the content of only those files being saved in the database which are allowed to be crawled. Errors, including the robots.txt file not being found or parsed correctly, and the cURL function not returning valid HTML content, etc, have been appropriately handled to ensure the functioning of the crawler is not affected.
- PHP: Used for server-side scripting to handle the backend logic, interact with the database, and generate dynamic content.
- HTML: Utilized for creating the structure and content of the web pages, defining the user interface elements.
- SQL: Employed for managing and querying the MySQL database, handling data storage and retrieval.
- cURL (Client URL): Used for making HTTP requests to fetch the HTML content of web pages. It facilitates the web crawling process by retrieving information from external websites.
- DOMDocument: Utilized for HTML parsing, allowing the extraction of specific elements and content from the fetched web pages. This is essential for analyzing and storing relevant data.
- MySQL: Chosen as the relational database management system to store crawled data. It provides a structured and efficient way to organize information, making it easily accessible for retrieval and analysis.
- Bootstrap: Implemented for styling and layout purposes, ensuring a consistent and visually appealing user interface. Bootstrap's responsive design elements enhance the application's accessibility across various devices and screen sizes.
- PHP (v.8.3.0 or above)
- XAMPP Server
-
Download the zipped code folder or clone this repository:
git clone [https://github.com/aaminaa01/web-crawler.git]
-
Put the downloaded or cloned (unzipped) folder in the htdocs folder (inside the xampp folder) e.g. on my machine the file path after putting this code folder in htdocs (present in D://) will be: D:\xampp\htdocs\web-crawler-main.
-
Run the XAMPP server.
-
Open any browser, and to setup the spider and crawl content from the seed URL, type the following into the search bar:
http://localhost/web-crawler-main/index.php
-
Now to search for any string within the crawled content, type the following into the search bar:
http://localhost/web-crawler-main/home.html
-
You can now search for strings in the crawled content.
Please note that the current depth is set to two levels down and maximum execution time is set to 1000 seconds. Users wishing to change these values may do so by modifying the variables $time_limit and $depth_limit in index.php.