jakopako / goskyr Goto Github PK
View Code? Open in Web Editor NEWA configurable command-line web scraper written in go with auto configuration capability
License: GNU General Public License v3.0
A configurable command-line web scraper written in go with auto configuration capability
License: GNU General Public License v3.0
This was fixed already but with refactoring had to be removed. The problem is smart-guessing the year of a date if it is not provided by the crawled website. An example is the Helsinki crawler. During the year this does not really matter but in between two years the dates would be wrong as the year would always be the 'old' one and not increased by one for dates that lie in the new year.
If there are two filters with match: true then we’d expect the item to be included in the results if at least one of those filters returns true. Currently, both have to return true for the item to be included.
Currently, those two config keys logically do very similar things. So it makes sense to merge their functionality into one config snippet.
Add comments to this example config file to explain the config options in detail.
It would be good to 'purify' this repo by only providing a generic crawler that extracts any structured data from websites. But since Github actions are very handy to execute a regular crawl with a specific configuration there could be a separate branch only for this purpose.
Add info which crawler logs which log string.
In some cases it might be necessary to filter certain events, such as in the case of Backstage. There we don't want the CORONA TESTZENTRUM events in our list.
Therefore, add a filter for text fields (currently title & comment) that is based on a regex.
Example: location 'Strom' There the description can be found on the event specific subpage but does not always have the same selector. Probably other locations have the same issue. Just look at the config.yml
and check locations that don't have the comment field defined.
I think it would be better to have separate field filters
per crawler like so:
filters:
- field: "title"
regex: "some regex"
- field: "comment"
regex: "some other regex"
This would make the implementation easier since we could simply loop over the array of filter items and apply the filter to the corresponding field. On the other hand we would still need sth like a switch case to map the field string to the corresponding field in the crawler struct... Think about this and improve the implementation.
Make field names customizable. Add a field type
that defines what parameters are needed to extract this field. Currently, there would be a text
type a url
type and a date
type.
Add a way to cope with different formats. Issue #35 possibly solves this as well.
Use fmt.Errorf instead of errMsg := fmt.Sprintf and then errors.New(errMsg)
It is better to separate the code development from the use case 'concert scraping'. Like that we would have a repository solely for managing which concert websites get crawled and could also accept pull requests for this.
currently, only existing fields can be used for filtering. Sometimes however it is necessary to filter on other text elements that are not used for any fields. One example would be the location 'LaCigale'
the respective post request in the github action results in a 400
Not all crawlers have filters to remove canceled or postponed concerts. Add filters where necessary.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.