This is a Python script that downloads data related to travel in Taiwan. The script scrapes text data from various Korean websites, including:
This script uses Python3. Before running this script, please make sure you have the following libraries installed:
- requests
- beautifulsoup4
- PyPDF4
- pdfminer
- pdfplumber
If you have not installed these libraries, please install them with the following command:
!pip install requests beautifulsoup4 PyPDF4 pdfminer pdfplumber
- Clone or download the repository to your computer.
- Open main.py with any Python IDE.
- Run the script.
- Magazine PDF will be downloaded in the pdf directory.
- Magazine txt will be downloaded in two versions.
txt_pages
: Separated by pages(ex: v01-p1.txt)txt_volumes
: Separated by volumes(ex: v01.txt)
thema.txt
- Theme Travel from 테마여행.pro.txt
- Professional travel from 프로대만족.place.txt
- Taiwan's attractions from 대만 명소.Quarterly magazine PDFs
- Vol. 1~47 quarterly magazine PDF(대만관광격월간) from TVA website.
- Currently, there are
20
pages in Professional travel(프로대만족) and8
pages in Taiwan's attractions(대만 명소) . If there're new articles in those pages, adjust the numbers inget_pro_page
andget_place_page
functions in taiwanTour.py . - Currently, there are
47
volumes of magazines. If there's new article, adjust the number ofgetMagazine
function in main.py . - The usefulness of each magazine page depends on the following conditions:
- If the number of characters exceeds a certain number, it will be judged as useful content. (Based on experiment, using 140 characters as a criterion provides better quality)
- If the content includes key words such as '도표', "통계", "Content", "fax", the page will be skipped. (To avoid extracting tables, contacts, and content tables)
- V16 magazine is in image format, so the program failed exracting content.