redco / goose-parser Goto Github PK
View Code? Open in Web Editor NEWUniversal scraping tool, which allows you to extract data using multiple environments
Home Page: https://andrew.red/posts/goose-parser-the-beginning
License: MIT License
Universal scraping tool, which allows you to extract data using multiple environments
Home Page: https://andrew.red/posts/goose-parser-the-beginning
License: MIT License
A project with configured build for browser env and working basic usage example. Users will be able to clone this repo and start hacking!
For example we need to do two actions:
Add ability to add custom pagination method
Sometime we need to specify data type in the parsed result.
For now this is a "type" field. But also we use "type" keyword as determine type of actions and transforms . So we need to rename "type" to "dataType" in cases when it has data type meaning.
Now when we have plenty of docs, it's time to add a table of content in the head of Readme.md
Add docker files to use goose-parser from envs:
Sometimes we need to click or do some custom actions before parse a row from grid.
Example to specify event:
{event: 'parse.pre', scope: 'div.expl-open-ticket-button', action: 'click'}
Add selenium environment to simplify testing process
We have stores now and can set data in one action and get it inside another, useActionsResult
is used to get actions result from previous actions, it's rudiment and should be removed
Ability to go deep to row information page to parse it inside.
For example, we have list of rows, but each of it has information page, where we should parse the row.
Needed:
Example:
We have parsed data:
We need to get:
Return new parsing scope from action
It should:
Implement:
Add documentation
Ability to parse scope attributes. For example row deep link.
Add and configure coveralls.io
Prepare basic documentation about
For now on PhantomError we get only error, but this could happen after parsing several rows.
Ability to parse or set row identifier depends on row scope.
By this identifier we can say that this row is unique among others.
Add tests
Needed:
When page load pagination happens, application getting errors about Sizzle (because promise trying to check data with it).
For now on PhantomError we get only error, but this could happen after parsing several rows.
We need to provide results together with error.
Before parse we need to check if that row required to be parsed.
For example we have list of ids previously parser, compare if current row _id is in the list - continue without parsing.
!! We need to parse _id
first before parsing all others.
We can extend rules and actions from any defined one in the system. And have an ability to override some particular properties.
This will allow to maintain a big amount of similar rules in easy way.
PhantomEnv should allow to set proxies list
and knows moment when to switch it between each others.
Remember which proxies was used and last time and url of using.
Probably, also set strategy of switching:
Add ability to paginate via simple pages
Ability to parser URL in simple rule
Add to docs:
Needed:
Needed:
Example:
We have parsed data:
We need to get:
Add SlimerEnvironment
https://slimerjs.org
Add excludes for loading resources on the page - attach to event which allows to cancel loading.
That kind is very close to scroll pagination, but instead of scrolling you need to click on the block to load new page [extend current page list].
Need to think about a way to build custom pagination event, which allows to cover any case of pagination.
import {
PhantomEnvironment,
Parser
} from 'goose-parser';
const env = new PhantomEnvironment({
url: 'http://www.gooseplanet.ru/'
});
TypeError: _gooseParser.PhantomEnvironment is not a constructor
I look at the imported entities and both of them is undefined.
If I write:
import Parser from 'goose-parser'
It return [Function: Parser]
But where I can find PhantomEnvironment???
Add ability to execute custom actions before start parsing.
For example we need to
This functionality will completely replace actions with once
flag
Add ability to paginate via ajax pages
Move current test system to new efficient way (Take a look on new tests here #76)
So, we have:
old tests here: tests/phantom_parser_test.js
new tests here: tests/phantom/lib/
To run tests, just call npm test
When move some test, remove it in tests/phantom_parser_test.js
At the end remove a file tests/phantom_parser_test.js and html page for it.
Add to docs:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.