mertend / node-crawler Goto Github PK

Node-Crawler is a highly customizable, Node-based web application for creating web crawlers and further processing and transforming the retrieved data.

Home Page: http://nodecrawler.mertendieckmann.de

JavaScript 0.40% TypeScript 97.42% CSS 1.84% Dockerfile 0.34%

engine nextjs node-based reactflow webcrawler node-crawler tools

node-crawler's People

Contributors

Stargazers

Watchers

Forkers

odnodn

node-crawler's Issues

Introducing Text Transformation Node with Advanced Manipulation and Regex Capabilities

There should be a Text Transformation Node. This node aims to empower users with the ability to manipulate and modify text content efficiently. Core functionalities would include the use of regular expressions (regex) for extracting specific patterns from text, options for appending or prepending content, replacing segments of text either through plain text or regex patterns, transforming text case (e.g., uppercase, lowercase, title case), and trimming whitespace. Such a feature would be invaluable for tasks like data cleaning and content creation

Implement Step-by-Step Execution Feature (Debug Tool)

Description

Develop a step-by-step execution feature for the web crawler engine. This feature should allow users to manually control the execution of the web crawler, proceeding one step at a time. This will be similar to a debug tool in an Integrated Development Environment (IDE), allowing users to better understand the execution process and identify any issues or inefficiencies.

Tasks

Design and implement a user interface for the step-by-step execution feature.
Develop backend functionality to pause and resume execution at each node.
Implement a system for displaying the current state of the web crawler at each step, including the data being processed.
Ensure that the feature works seamlessly with the existing execution engine and user interface.
Test the feature thoroughly to ensure reliability and usability.

Feedback on Edge Selection

Description

Provide visual or other forms of feedback when an edge is selected within the workflow to enhance user interaction.

Undo and Redo

Description

Develop an undo and redo feature for the web crawler editor. This feature should allow users to easily revert their actions or reapply them, providing a safety net when building and editing web crawlers. This will enhance the user experience by allowing users to experiment with different configurations without fear of making irreversible changes.

Tasks

Design and implement a user interface for the undo and redo feature, ensuring it is intuitive and accessible.
Implement a system for managing the state of the web crawler editor that supports undoing and redoing actions.
Ensure that the feature works seamlessly with the existing editor and user interface.
Test the feature thoroughly to ensure reliability and usability.

Extractor Node to Find Tag by Content

Description

Enhance the Extractor Node to allow finding HTML tags based on their content (Reverse search), providing more flexibility in data extraction.

Unrelated Edges Deleted When Node is Removed

Description

There is a bug in the on canvas node placement tool. When multiple nodes are placed using the tool and then one of them is selected and deleted, all placed edges are also deleted. This occurs even for edges that are not connected to the deleted node. This bug disrupts the workflow and forces users to recreate unrelated connections, which can be time-consuming and frustrating.

Implement Dynamic connectionRules for Multiple NodeTypes

Description

We aim to enhance our node system to support nodes with dynamic input configurations. Users should be able to define the number and types of inputs a node can have, and the connectionRules should adjust accordingly. Specific nodes that can benefit from this feature include the "Zip" and "Database" nodes.

Implement ChatGPT Node

Description

Develop a new node type that can take any command and use ChatGPT to generate a response. This node should utilize guardrails in the backend to ensure the quality of the output. The API token for ChatGPT should be configurable within the application, allowing users to easily set up and manage their OpenAI account details.

Tasks

Implement a system for configuring the ChatGPT API token within the application.
Design and implement a user interface for the new ChatGPT node, ensuring it is intuitive and accessible.
Develop backend functionality to send commands to ChatGPT and receive responses.
Implement guardrails to ensure the quality of the output from ChatGPT.
Ensure that the new node works seamlessly with the existing web crawler editor and execution engine.
Test the new node thoroughly to ensure reliability and usability.

Delete Option in Node Options

Description

Implement a delete option within the node options to allow users to remove a node directly from its options panel.

Better handling of data flow

Description

Currently all the nodes have a list of values as a output. That is not necessary optimal for many nodes. For example when the extractor node is extracting multiple elements for each of the input elements is might be useful to group the extracted elements together.

Currently it looks like:

Input: [1, 2]
Output: [1.1, 1.2, 2.1]

I need something like this, where the context from where it is coming from remains:

Input: [1,2]
Output: [[1.1, 1.2], [2.1]]

Both Outputs have their own unique use cases. There should be a system that supports both. Maybe it could be useful to do the second approach as a default and to offer an option to flatten the output. Either by creating a new node for it or by putting it as a default option in the options of a node.

Visualize Data Type on Edges

Description

Enhance the visualization of edges in the web crawler editor to indicate the type of data being transported. This could be achieved through various means such as different colors, multiline edges for arrays, or other visual indicators. This will provide users with a clearer understanding of the data flow in their web crawler workflows.

Tasks

Design a system for visually indicating the type of data on an edge (e.g., different colors, multiline edges for arrays).
Implement updates to the edge rendering in the web crawler editor to include the new visual indicators.
Ensure that the visual indicators correctly reflect the type of data being transported on the edge.
Update the user interface to explain the meaning of the different visual indicators.
Ensure that the updated edge visualization works seamlessly with the existing web crawler editor and execution engine.
Test the updated edge visualization thoroughly to ensure reliability and usability.

Intuitive Drag-and-Drop List Ordering Feature

Description

An "Options Container" should be introduced to the interface, offering drag-and-drop functionality for users to effortlessly rearrange a list of components. The primary application of this feature targets the database table node, allowing users to modify the order of inputs based on their workflow and preferences, ensuring a more personalized and efficient user experience.

Improve System for Adding New Nodes

Description

Develop a new system for adding nodes to the web crawler editor. The goal is to have everything related to a node (the node itself, connection rules, metadata, etc.) defined in one place. One possible approach is to define everything in one file and reference the metadata of each node from the NodeType enum. This will simplify the process of adding new nodes and make the codebase easier to manage.

Automate the NodeMapTransformer to streamline the process of adding new nodes. Currently, each new node requires manual updates to the NodeMapTransformer.

Pop-up on Browser Window Reload

Description

Implement a pop-up warning or confirmation dialog when the user attempts to reload the browser window, to prevent accidental loss of work.

Implement View Mode for Highlighting Edges and Node Handles Based on Output Value Type

Description

Develop a new view mode for the web crawler editor that highlights edges and node handles based on the type of output value they are transporting. Edges should be colored according to the type of data they are carrying, and node handles should be highlighted in the colors of the types they accept. This will provide users with a clearer understanding of the data flow in their web crawler workflows.

Implement Dynamic Node Output Based on Input and Enforce Single OutputValueType per Handle

Description

Develop a system that can dynamically change the output of a node based on its input. This could potentially be implemented directly in the connection rules by mapping an input value to an output value. Additionally, enforce that each node handle can only accept one output value type. If a user tries to connect different types to the same handle, they should be warned that the new connection will remove the old ones. If the output value type changes and the outgoing connections no longer match the next inputs, these connections should be automatically deleted.

Tasks

Design a system for dynamically changing a node's output based on its input.
Update the connection rules to enforce that each handle can only accept one output value type.
Implement a warning system that informs the user when a new connection will remove old ones due to a mismatch in output value type.
Implement a mechanism that automatically deletes outgoing connections when the output value type changes and the connections no longer match the next inputs.
Test the new systems thoroughly to ensure reliability and usability.

Ability to pass multiple URLs to a fetch node

Description

It would be helpful to be able to pass multiple URLs to a fetch node. This would allow us to fetch and process multiple data sources at the same time. Currently, we can only pass one URL to a fetch node, which means that if we want to fetch multiple data sources, we need to create multiple fetch nodes. This can lead to cluttered and difficult to read node map.

Implement Multi-Input Merge Node

Description

Develop a new node type that can combine multiple inputs into a single output. This node should be able to take data from multiple preceding nodes and merge it into a single data set. This will allow users to combine data from different sources or stages of the web crawler workflow.

Tasks

Implement a system for configuring the merging behavior, allowing the user to specify how the inputs should be combined.
Design and implement a user interface for the new node, ensuring it is intuitive and accessible.
Develop backend functionality to merge multiple inputs into a single output.
Ensure that the new node works seamlessly with the existing web crawler editor and execution engine.
Test the new node thoroughly to ensure reliability and usability.

Show names of inputs and outputs in the editor

Description

The handles should be labeled with the names given in the connectionRules. The labels provide a clear indication of what each handle is connected to, thereby reducing potential confusion and enhancing usability. This labeling should be consistently applied across all relevant parts of the application or system, ensuring a uniform user experience.

Implement System to Rerun or Restart Process at Current Node with Cached Value

Description

Develop a system that enables the rerunning of the current node or restarting the process at the current node if it has a cached value. This feature will provide users with more flexibility and control over the execution of the web crawler, allowing for more efficient debugging and optimization of the workflow.

Options container cutting of inputs

Description

The options container cuts of content, if the options contain a lot of inputs.

This screenshot shows the options menu content cut of with the scroll state at the very top:

Screenshot of the application, showing the options cut of at the top.

Error visualization of node in the canvas

Description

When a node encounters an error, it should be highlighted, possibly in red. Displaying the error message in the log could be beneficial. To ensure maximum compatibility, consider introducing a wrapper function around the run() method in the BasicNode. This run() method would be executed within a try-catch block. Subsequently, the Engine can invoke this new wrapper method instead of the direct run() method. If an error is caught, the node will be highlighted, and the error message will be logged.

Prevent Node Deletion During Crawler Execution

Description

Implement a safeguard to prevent users from deleting nodes while the crawler is actively running. This will ensure the integrity of the crawling process and prevent potential errors or inconsistencies.

Implement Autocompletion for HTML Extractor Node Tag Option

Description

Develop an autocompletion feature for the HTML Extractor Node's tag option. This feature should provide users with suggestions for HTML tags based on the cached data from the previous node, if it exists. This will enhance the user experience by making it easier and faster to select the appropriate HTML tags when configuring the HTML Extractor Node.

Tasks

Develop backend functionality to analyze the cached data and generate a list of potential HTML tags.
Implement a system for displaying the autocompletion suggestions and allowing the user to select a suggestion.
Ensure that the feature works seamlessly with the existing HTML Extractor Node and user interface.
Test the feature thoroughly to ensure reliability and usability.

Ctrl+Backspace Behavior in Options Text Field

Description

Prevent the Ctrl+Backspace key combination in a text field within the options from deleting the associated node.

Enhance Extractor Node to Support Multiple Extraction Modes

Description

Enhance the functionality of the Extractor Node to support multiple extraction modes. The node should be able to find either the first occurrence, all occurrences, or the nth occurrence of a specified pattern. Depending on the extraction mode, the node should output the results as a plain string/HTML or as an array of strings/HTML.

Tasks

Implement a system for configuring the extraction mode within the Extractor Node.
Design and implement updates to the user interface of the Extractor Node to support the new extraction modes.
Develop backend functionality to perform the different types of extraction based on the configured mode.
Ensure that the output of the Extractor Node is correctly formatted based on the extraction mode (plain string/HTML or array of strings/HTML).
Ensure that the enhanced Extractor Node works seamlessly with the existing web crawler editor and execution engine.
Test the enhanced Extractor Node thoroughly to ensure reliability and usability.

Extracting CSS selectors per click on website

Description

There should be a system for easily extracting CSS selectors from an embedded website through user interactions, specifically clicks. The goal is to allow a user to click on various elements within an iframe or embedded web view, and then determine the unique CSS selectors for those elements.

Incorporate Array Support for Node Input and Output Values

Description

Enhance the functionality of nodes to support array inputs and outputs. Currently, nodes only support single values for input and output. A system needs to be developed that can handle arrays and correctly process these in the web crawler workflow. Everything should be based on arrays, even though most of them will only have one value in them.

Implement Infinite Loop Detection for Web Crawler

Implement a system for detecting infinite loops in the web crawler. Currently, the program crashes when executing a crawler that contains an infinite loop. There are two potential solutions: log the failure and let the user figure out the loop for themselves, or detect and prevent the creation of infinite loops during the crawler construction phase.

Add MkDocs Documentation

Description

Add MkDocs Documentation to project.

mertend / node-crawler Goto Github PK

node-crawler's People

Contributors

Stargazers

Watchers

Forkers

node-crawler's Issues

Description

Tasks

Description

Description

Tasks

Description

Description

Description

Description

Tasks

Description

Description

Description

Tasks

Description

Description

Description

Description

Description

Tasks

Description

Description

Tasks

Description

Description

Description

Description

Description

Description

Tasks

Description

Description

Tasks

Description

Description

Description

Recommend Projects

Recommend Topics

Recommend Org