jakopako / goskyr Goto Github PK

View Code? Open in Web Editor NEW

32.0 32.0 4.0 11.29 MB

A configurable command-line web scraper written in go with auto configuration capability

License: GNU General Public License v3.0

Go 99.35% Makefile 0.65%

go golang scraper webscraping

goskyr's People

Contributors

Stargazers

Watchers

Forkers

opacous markjaroski laerm lekesiz

goskyr's Issues

This was fixed already but with refactoring had to be removed. The problem is smart-guessing the year of a date if it is not provided by the crawled website. An example is the Helsinki crawler. During the year this does not really matter but in between two years the dates would be wrong as the year would always be the 'old' one and not increased by one for dates that lie in the new year.

Write README

Improve goreport

https://goreportcard.com/report/github.com/jakopako/goskyr

Add Sender

https://gds.fm/SENDER#main-content

Filters don’t work as expected in all cases

If there are two filters with match: true then we’d expect the item to be included in the results if at least one of those filters returns true. Currently, both have to return true for the item to be included.

Merge `exclude` config with `filters` config

Currently, those two config keys logically do very similar things. So it makes sense to merge their functionality into one config snippet.

Add example config file

Add comments to this example config file to explain the config options in detail.

Separate branch for event crawling action

It would be good to 'purify' this repo by only providing a generic crawler that extracts any structured data from websites. But since Github actions are very handy to execute a regular crawl with a specific configuration there could be a separate branch only for this purpose.

Fix code ql errors

Add release process to readme

Add LaCigale

http://www.lacigale.fr/en/

Improve logging

Add info which crawler logs which log string.

Add Tonhalle

https://www.tonhalle-orchester.ch/en/home-tz/konzerte/kalender/

Add filter option

In some cases it might be necessary to filter certain events, such as in the case of Backstage. There we don't want the CORONA TESTZENTRUM events in our list.

Therefore, add a filter for text fields (currently title & comment) that is based on a regex.

Remove crawl action from github actions

Rename `loc` keys to `selector` in configuration yml

Add Mongodb output

Add Kunstraum Walcheturm

https://walcheturm.ch/agenda/

Check out dateparse library to possibly simplify date config

https://github.com/araddon/dateparse

How to handle more complex event comments (descriptions)?

Example: location 'Strom' There the description can be found on the event specific subpage but does not always have the same selector. Probably other locations have the same issue. Just look at the config.yml and check locations that don't have the comment field defined.

Add generic writer interface for json output

Mehrspur Events with month 'Mrz' not parsed correctly

Change implementation of filter function

I think it would be better to have separate field filters per crawler like so:

filters:
  - field: "title"
    regex: "some regex"
  - field: "comment"
    regex: "some other regex"

This would make the implementation easier since we could simply loop over the array of filter items and apply the filter to the corresponding field. On the other hand we would still need sth like a switch case to map the field string to the corresponding field in the crawler struct... Think about this and improve the implementation.