Scientific Journal Web Scraper

A web crawler that scrapes Indonesian scientific journal and article data from SINTA - Science and Technology Index and GARUDA - Garda Rujukan Digital, built using Python3 and BeautifulSoup.

How To Run

$ python scrape_web

Features

Traverse the tables in SINTA - Science and Technology Index and GARUDA - Garda Rujukan Digital sequentially.
Scrapes journal data from SINTA - Science and Technology Index.
Scrapes article abstract data from GARUDA - Garda Rujukan Digital.
Scrapes only Indonesian abstracts by detecting the language of the abstract.

How It Works

A single agent will traverse the table in http://sinta.ristekbrin.go.id/journals and check each row for the <img> tag with the class stat-garuda-small. If the tag exists, the agent will go deeper by accessing the url listed in the href property that's anchored to the <a> tag in that specific row. The agent will then traverse the table in said url, scraping text data from the <xmp> tag with the class abstract-article. The script will append "?page=2" to the url and increment the page number to continue traversing the following pages. Only after the pages have run out will the agent exit the nested traversal process and continue the main traversal process.

Since this program is targeted to collect Indonesian scientific journal and article data, the library langdetect is utilized to make sure that the text data that's scraped is Indonesian. The language checking process is done by splitting the first two sentences of the paragraph and checking the language of both sentences. If the language of one of the two sentences is not Indonesian, then the paragraph would not be scraped.

Data Gathered

The newly scraped data is saved in ./output/output.csv directory with the csv header being JOURNAL_TITLE, ARTICLE_TITLE, and ARTICLE_ABSTRACT. The last time the data is scraped is on April 1st, 2020. The amount of data scraped thus far is 157,687 rows, consisting of 2,527 journals, and aggregated in ./data/master/ directory.

ssentinull / scientific-journal-web-scraper Goto Github PK