Simple Node application for parsing webpage articles according configuration. Notify about new articles by email.
My son goes to kindergarten. The kindergarten is very active on their website and regularly informs parents. So in order not to lose any news, I created this script.
The idea was to make it as simple as possible without so many features. I didn't want to use a database, so the data about the discovered articles is stored in a file.
In the News and Articles I am interested about TITLE, Initial TEXT and HREF to detailed article
<div id='content'>
<div class='row justify-content-center dlazdice'>
<div class='nadpis'>AKTUALITY</div>
<div class='col-md-3 text-box-obsah'>
<h2>Article title 01</h2>
<p>Article text 01</p>
<p>
<a class='btn btn-secondary tlacitko' href='index.php?p=some-page-01' role='button'>Show more...</a>
</p>
</div>
<div class='col-md-3 text-box-obsah'>
<h2>Article title 02</h2>
<p>Article text 02</p>
<p>
<a class='btn btn-secondary tlacitko' href='index.php?p=some-page-02' role='button'>Show more...</a>
</p>
</div>
</div>
</div>
Requires config.json file in the ROOT
{
"parseUrl": [
{
"name": "my-url-name-1",
"link": "https://example-domain.com"
},
{
"name": "my-url-name-2",
"link": "https://example-domain.com/articles/"
}
],
"parseUrlSelectors": {
"main": "div#content div.text-box-obsah",
"articleTitle": "h2",
"articleText": "p",
"articleDetailHref": "href"
},
"mailer": {
"host": "smtp.seznam.cz",
"port": "465",
"secure": true,
"auth": {
"user": "[email protected]",
"pass": "MySuperSecretPassword"
},
"mailOptions": {
"to": "[email protected], [email protected]",
"subject": "Subject for email"
},
"admin": "[email protected]"
}
}
- Fork this git-repo
- npm install
- npm run build
- -> This will generate (using webpack) bundled version in the ./dist/app.bundle.js
- Copy/Deploy app.bunde.js to your hosting with running Node env
- npm install --production
- Adjust config.json based on your needs and copy it to the same location as app.bundle.js
- node app.bundle.js
- Make this script run (CRON JOB) every 1 hour or so
- As default APP logs info messages to ./info.log and error messages to ./error.log
- Articles are stored to file ./storage.json
- puppeteer (API to control Chrome or Chromium)
- cheerio (Fast, flexible & lean implementation of core jQuery designed specifically for the server.)
- nodemailer (Send e-mails)