Code Monkey home page Code Monkey logo

scrapegraph-ai's Introduction

Hi there, I'm Marco Vinciguerra - aka Vinci 👋

I am a 24 year old boy and engineering student whose dream is to work in a big tech like Google. Every day I work hard to make it happen 💪🏻

I am currently working on Scrapegraph-ai.

🔭 I’m currently studing data science and data engineering master's degree course

🌱 I’m currently learning, Flutter, Math and Machine Learning

🥅 2024 Goals: Became a better developer and a better person

⚡ Fun fact: I love reading books and my favourite IDE is Neovim.

📫 How to reach me:   Linkedin Badge


Marco's GitHub stats


Tools and coding languages that I use

Java  Java  Java  React  Flutter  dart  firebase  figma  postgresql  MySQL  MySQL  Postman 

scrapegraph-ai's People

Contributors

arsaboo avatar cemkod avatar daniele-roncaglioni avatar darvat avatar dependabot[bot] avatar dito97 avatar dpende avatar duke147 avatar elijahbenizzy avatar eltociear avatar epage480 avatar f-aguzzi avatar ftoppi avatar iamgodot avatar jgalego avatar kahwoo avatar lurenss avatar mayurdb avatar perinim avatar schneehertz avatar semantic-release-bot avatar seyf97 avatar shkamboj1 avatar shubihu avatar skrawcz avatar spulci avatar supercoder-dev avatar tejhande avatar vincigit00 avatar yuan-manx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapegraph-ai's Issues

Implement Graph Visualization with Graphviz

Description:

To enhance our understanding and debugging capabilities of the scraping workflows, we aim to introduce a feature for visualizing the graph structures using Graphviz. This visualization will help in quickly identifying the flow, relationships between nodes, and potential bottlenecks in the graph.

Tasks:

  • Research Graphviz syntax and integration options with Python;
  • Develop a utility function or class method that converts our current graph structure into a Graphviz-compatible format;
  • Ensure the visualization includes node names, types, and directional edges.

Adding debugging mode

Is your feature request related to a problem? Please describe.
I am having a challenge finding out where my project is going wrong. When I run a prompt I receive {} as the output and don't get any further information as to why I got that as the result.

Describe the solution you'd like
Some sort of debugging mode with higher level of detail around each stage of the pipeline and what succeeded and failed to get down to the root of the issue.

How to use Azure portal PAT to scrap the data

I use the Azure DevOps portal for managing bug data, which I currently access using a Personal Access Token (PAT) and retrieve data through a Python script via APIs. Is it possible to integrate Scrapegraph-AI for this use case?

Playwright Implementation

Is your feature request related to a problem? Please describe.
Javascript-based website often failed to be fetched.

Describe the solution you'd like
Add the option to fetch the html using playwright, to add inside the fetch_node class.

Additional context
Try to make it headless

Malicious data inside DOM

What happens if some malicious data is hidden (not visible to the regular human user) in the DOM?

For example:

Article 1

Article 2

Bomb!

//this is the hidden one

What is the output with a prompt like "Extract all the titles"? Is there a sanitization of the DOM before AI processing?
Morever, can content influence how the AI "see" the data of the DOM?

running in colab notebook fails with err

result = smart_scraper_graph.run()
the line above results in the error below:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-7-2af99396d10b>](https://localhost:8080/#) in <cell line: 1>()
----> 1 result = smart_scraper_graph.run()

5 frames
[/usr/lib/python3.10/asyncio/runners.py](https://localhost:8080/#) in run(main, debug)
     31     """
     32     if events._get_running_loop() is not None:
---> 33         raise RuntimeError(
     34             "asyncio.run() cannot be called from a running event loop")
     35 

RuntimeError: asyncio.run() cannot be called from a running event loop

Allow using a remote Ollama

People don't always run apps on the same computer where they run Ollama.
Since you provide a docker-compose.yml to run ollama, you might run your app in another container.

Allow people to set base_url to configure ChatOllama to connect to a different URL than http://localhost:11434 .

AttributeError: 'SmartScraperGraph' object has no attribute 'model_token'

Describe the bug
I just followed the instruction and ran the example and getting the following error

I have verified ollama is running and I can use the models using ollama-webui

I am using following models and those are downloaded and working in webui

ollama/gemma:2b
ollama/nomic-embed-text"

image

Traceback (most recent call last):
  File "/home/leo/dev/projs/scrapegraph/main.py", line 16, in <module>
    smart_scraper_graph = SmartScraperGraph(
                          ^^^^^^^^^^^^^^^^^^
  File "/home/leo/dev/projs/scrapegraph/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 24, in __init__
    super().__init__(prompt, config, source)
  File "/home/leo/dev/projs/scrapegraph/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py", line 25, in __init__
    self.graph = self._create_graph()
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/leo/dev/projs/scrapegraph/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 39, in _create_graph
    node_config={"chunk_size": self.model_token}
                               ^^^^^^^^^^^^^^^^
AttributeError: 'SmartScraperGraph' object has no attribute 'model_token'

Process finished with exit code 1

To Reproduce
Steps to reproduce the behavior:

  1. Run Ollama using docker
  2. copy paste the example
  3. install scrapegraphai
  4. See error

Expected behavior
It should work seamlessly

Screenshots
If applicable, add screenshots to help explain your problem.
image

Desktop (please complete the following information):

  • OS: Ubuntu 23.10

Library components interaction

Is your feature request related to a problem? Please describe.
Difficulty to understand how it works under the lines and how it is structured

Describe the solution you'd like
A scheme showing what are the main modules contained in the library and how they interact with each other.
Showing different types of graphs and nodes etc.

The job post does not contain any body content.

Describe the bug
I'm trying to scrape the content of a page, asking to summarize the job post. What I get as answer is the following:
"The job post does not contain any body content."

To Reproduce
Scrape for example the following page

Expected behavior
Getting a summary of the page content.

Fix releases in github

In order to make clear to the users the current release i suggest to implement a github action for release.

Basically move from that:

Screenshot 2024-03-12 at 11 23 43

To that:
image

Scrape JSON files

Is your feature request related to a problem? Please describe.
Would like to get information from a JSON file / string.

Describe the solution you'd like
I want to be able to use scraping graphs like SmartScraper with json input. Maybe creating a new graph or node

Fix XML example

Describe the bug
XML example file parse the input as if it had html content. The problem is related to the remover() function inside the fetch_node class.

To Reproduce
Steps to reproduce the behavior:

  1. Run examples/openai/scrape_xml_openai.py
  2. Warning:

--- Executing Fetch Node ---
C:\Python\Python311\Lib\html\parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument features="xml" into the BeautifulSoup constructor.
k = self.parse_starttag(i)
--- Executing Parse Node ---

  1. Output:

{'books': [{'title': "XML Developer's Guide", 'author': 'Unknown', 'genre': 'Unknown'}]}

Expected behavior
Do not raise exceptions

Screenshots
No screenshots

Desktop (please complete the following information):

  • OS: Windows
  • Browser [e.g. chrome, safari]
  • Library Version: 0.4.1

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
No additional context

Hugging Face API

Would it be possible to add the Hugging Face API to access many different models?

Ollama embedding model is always `llama2`

Add getState() function to Abstract Graph

Return the complete final state dictionary of the graph.
So, for example it can be used as:

final_state = smart_scraper.getState()

It could be possible to add an additional arguments to directly specify the key you want from the state

urls = smart_scraper.getState(key="urls")

Update contributing.md with new policy ci/cd

Add in the contribuiting.md the new ci/cd workflow.

People that want to contribute have the create a branch from pre/beta, once the development is done, merge on the pre/beta, using semantic release keywords

Add verbosity flag to remove print statements

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

The print statements make the outputs too verbose - it would be useful to optionally disable them using a flag

Describe the solution you'd like
A clear and concise description of what you want to happen.

Add an optional flag in the node classes to remove print statements

Ollama embeddings not using base_url

I am trying the example code given with Ollama.

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/llama3",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        "base_url": "http://192.168.2.162:11434",  # set Ollama URL arbitrarily
        "model_tokens": 4000,
    },
    "embeddings": {
        "model": "ollama/mxbai-embed-large",
        "temperature": 0,
        "base_url": "http://192.168.2.162:11434",  # set Ollama URL arbitrarily
    }
}

However, it appears that the base_url is completely ignored for the embeddings.

I was able to run it with ollama/nomic-embed-textembedding model (which is on my local computer). If I choose any other embedding model, such asollama/snowflake-arctic-embed:latestorollama/mxbai-embed-large, which are available on my server (on 192.168.2.162), get the following error (note that llama3is not available on my local computer and is correctly being used from thebase_url`):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], [line 15](vscode-notebook-cell:?execution_count=7&line=15)
      [9](vscode-notebook-cell:?execution_count=7&line=9) for source in sources:
     [10](vscode-notebook-cell:?execution_count=7&line=10)     smart_scraper_graph = SmartScraperGraph(
     [11](vscode-notebook-cell:?execution_count=7&line=11)         prompt="Get the Title, Place of Publication, Publisher, Dates of publication, Frequency, and Notes. Don't return anything else.",
     [12](vscode-notebook-cell:?execution_count=7&line=12)         source=source,
     [13](vscode-notebook-cell:?execution_count=7&line=13)         config=graph_config
     [14](vscode-notebook-cell:?execution_count=7&line=14)     )
---> [15](vscode-notebook-cell:?execution_count=7&line=15)     result = smart_scraper_graph.run()
     [16](vscode-notebook-cell:?execution_count=7&line=16)     results.append(result)

File [c:\Users\arsab\anaconda3\lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py:74](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:74), in SmartScraperGraph.run(self)
     [70](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:70) """
     [71](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:71) Executes the web scraping process and returns the answer to the prompt.
     [72](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:72) """
     [73](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:73) inputs = {"user_prompt": self.prompt, self.input_key: self.source}  
---> [74](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:74) self.final_state, self.execution_info = self.graph.execute(inputs)
     [76](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:76) return self.final_state.get("answer", "No answer found.")

File [c:\Users\arsab\anaconda3\lib\site-packages\scrapegraphai\graphs\base_graph.py:83](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:83), in BaseGraph.execute(self, initial_state)
     [80](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:80) current_node = self.nodes[current_node_name]
     [82](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:82) with get_openai_callback() as cb:
---> [83](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:83)     result = current_node.execute(state)
     [85](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:85)     node_exec_time = time.time() - curr_time
     [86](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:86)     total_exec_time += node_exec_time

File [c:\Users\arsab\anaconda3\lib\site-packages\scrapegraphai\nodes\rag_node.py:101](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:101), in RAGNode.execute(self, state)
     [98](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:98) else:
     [99](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:99)     raise ValueError("Embedding Model missing or not supported")
--> [101](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:101) retriever = FAISS.from_documents(
    [102](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:102)     chunked_docs, embeddings).as_retriever()
    [104](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:104) redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
    [105](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:105) # similarity_threshold could be set, now k=20

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_core\vectorstores.py:550](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_core/vectorstores.py:550), in VectorStore.from_documents(cls, documents, embedding, **kwargs)
    [548](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_core/vectorstores.py:548) texts = [d.page_content for d in documents]
    [549](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_core/vectorstores.py:549) metadatas = [d.metadata for d in documents]
--> [550](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_core/vectorstores.py:550) return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_community\vectorstores\faiss.py:930](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:930), in FAISS.from_texts(cls, texts, embedding, metadatas, ids, **kwargs)
    [903](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:903) @classmethod
    [904](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:904) def from_texts(
    [905](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:905)     cls,
   (...)
    [910](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:910)     **kwargs: Any,
    [911](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:911) ) -> FAISS:
    [912](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:912)     """Construct FAISS wrapper from raw documents.
    [913](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:913) 
    [914](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:914)     This is a user friendly interface that:
   (...)
    [928](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:928)             faiss = FAISS.from_texts(texts, embeddings)
    [929](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:929)     """
--> [930](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:930)     embeddings = embedding.embed_documents(texts)
    [931](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:931)     return cls.__from(
    [932](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:932)         texts,
    [933](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:933)         embeddings,
   (...)
    [937](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:937)         **kwargs,
    [938](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:938)     )

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_community\embeddings\ollama.py:211](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:211), in OllamaEmbeddings.embed_documents(self, texts)
    [202](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:202) """Embed documents using an Ollama deployed embedding model.
    [203](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:203) 
    [204](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:204) Args:
   (...)
    [208](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:208)     List of embeddings, one for each text.
    [209](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:209) """
    [210](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:210) instruction_pairs = [f"{self.embed_instruction}{text}" for text in texts]
--> [211](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:211) embeddings = self._embed(instruction_pairs)
    [212](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:212) return embeddings

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_community\embeddings\ollama.py:199](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:199), in OllamaEmbeddings._embed(self, input)
    [197](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:197) else:
    [198](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:198)     iter_ = input
--> [199](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:199) return [self._process_emb_response(prompt) for prompt in iter_]

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_community\embeddings\ollama.py:199](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:199), in <listcomp>(.0)
    [197](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:197) else:
    [198](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:198)     iter_ = input
--> [199](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:199) return [self._process_emb_response(prompt) for prompt in iter_]

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_community\embeddings\ollama.py:173](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:173), in OllamaEmbeddings._process_emb_response(self, input)
    [170](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:170)     raise ValueError(f"Error raised by inference endpoint: {e}")
    [172](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:172) if res.status_code != 200:
--> [173](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:173)     raise ValueError(
    [174](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:174)         "Error raised by inference API HTTP code: %s, %s"
    [175](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:175)         % (res.status_code, res.text)
    [176](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:176)     )
    [177](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:177) try:
    [178](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:178)     t = res.json()

ValueError: Error raised by inference API HTTP code: 404, {"error":"model 'mxbai-embed-large' not found, try pulling it first"}

To confirm my suspicion, I pulled mxbai-embed-large on my local computer and then I was able to run the same.

Convert Graphviz Diagrams to Scraping Class Structures

Description:

Building on our graph visualization capabilities, we aim to develop a feature that allows the creation of scraping class structures from Graphviz diagrams. This will enable rapid prototyping of scraping workflows from visual designs.

Tasks:

  • Define a standard format or conventions for Graphviz diagrams representing scraping workflows;
  • Implement a parser that reads Graphviz diagram files and extracts node and edge information;
  • Develop logic to generate Python class definitions for nodes and graphs based on the parsed diagram.

Please add examples to the documentation, struggling with understanding how it works

Wonderful! I was looking for something like this library!

But I found difficult to use the library because there are no examples.

For instance, I need to extract the name, the province and the address of all cinema theaters listed on this website:

PROVINCES
https://www.mymovies.it/cinema/

The information is divided on 3 levels: the above page contains the list of provinces, each cuty province links to the names of the theaters:

PROVINCES->THEATERS
https://www.mymovies.it/cinema/roma/

And each theatre name links to the address:

PROVINCES->THEATRES->ADDRESS
https://www.mymovies.it/cinema/roma/5157/

Sometimes there is a 4th level for smaller towns inside the province area of the city:

PROVINCES->SMALL_TOWNS
https://www.mymovies.it/cinema/roma/fianoromano/

PROVINCES->SMALL_TOWNS->THEATRES->ADDRESS
https://www.mymovies.it/cinema/roma/fianoromano/5102/

How am I supposed to ask that and get back the list of all Italian cinema theatres names and address divided by province?

Thanks in advance!

Create Scraping Graphs Based on User Prompt

Description:

To simplify the creation of LLM-powered scraping graphs, we propose the development of a Graph Builder class. This class will abstract the complexity of graph construction, allowing users to create powerful scraping workflows using only a natural language prompt and selecting from available nodes in the library.

Tasks:

  • Design the Graph Builder class interface, focusing on ease of use and flexibility;
  • Implement logic to interpret the prompt and select appropriate nodes and configurations;
  • Integrate with the existing LLM setup to ensure seamless generation of answers and interactions within the graph.

ModuleNotFoundError: No module named 'fp'

Describe the bug

I run https://github.com/VinciGit00/Scrapegraph-ai/blob/main/examples/openai/smart_scraper_openai.py and show error: ModuleNotFoundError: No module named 'fp'

To Reproduce
I run the following example and get error

"""
Basic example of scraping pipeline using SmartScraper
"""

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()


# ************************************************
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("OPENAI_APIKEY")

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "gpt-3.5-turbo",
    },
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their description.",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects/",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

D:\Application\Conda\envs\agentkit\python.exe D:\Projects\githubhub\agentkit\tmp.py 
Traceback (most recent call last):
  File "D:\Projects\githubhub\agentkit\tmp.py", line 7, in <module>
    from scrapegraphai.graphs import SmartScraperGraph
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\graphs\__init__.py", line 5, in <module>
    from .smart_scraper_graph import SmartScraperGraph
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py", line 5, in <module>
    from ..nodes import (
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\nodes\__init__.py", line 5, in <module>
    from .fetch_node import FetchNode
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\nodes\fetch_node.py", line 9, in <module>
    from ..utils.remover import remover
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\utils\__init__.py", line 8, in <module>
    from .proxy_rotation import proxy_generator
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\utils\proxy_rotation.py", line 4, in <module>
    from fp.fp import FreeProxy
ModuleNotFoundError: No module named 'fp'

Method to pass common params to all the nodes in the graph

Is your feature request related to a problem? Please describe.
There is no way to pass many parameters all at once to each node in a graph. For example, the verbose flag would be nice to have it already in all the nodes.

What I am doing now is defining the params inside the AbstractGraph (which is inherited in all the scraping graphs) with something like:

image

and then pass it in each node explicitily, for example inside SmartScraperGraph:

image

You may understand that this method doesn't scale up well.

Describe the solution you'd like
I would like a function like this inside AbstractGraph that updates all the node_config dict for each node:

image

Describe alternatives you've considered
None

Additional context
None

The ability to use OpenAI embeddings with Groq

Hey there! It would be great if we could use OpenAI embeddings (or any other supported API-based embedding models) with Groq (or any other supported llm). With the current way the code is organized, you can only use OpenAI embeddings with OpenAI models. If I want to use Groq as my main llm, I would have to use Ollama, which is ok if you want to run models locally. But I don't want to install models on my local machine, I would prefer to use OpenAI as my embedder service.

One way to add this is to change the way self.embedder_model is initialized in the AbstractGraph class. Currently, both self.llm_model and self.embedder_model are initialized using one method, self._create_llm(), which kinda makes our options limited. One possible solution is to add another function e.g. self._create_embedder() and completely separate the logic for initialization of llms and embedder models.

nomic-embed-text KeyError: 'Model not supported'

Describe the bug
I am trying to run the example from the readme.md file using ollama but the following error is raised:

    229 try:
--> 230     models_tokens["ollama"][embedder_config["model"]]
    231 except KeyError:

KeyError: 'nomic-embed-text'

To Reproduce
Steps to reproduce the behavior:

  1. Install Ollama and Scrapergraph-ai
  2. Create a script with the following code:
from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/llama3",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        "base_url": "http://localhost:11434",  # set Ollama URL
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",  # set Ollama URL
    }
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)
  1. See error:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File [~/venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py:230](http://localhost:8888/lab/tree/~/Documents/projects/techtonga-reddit/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py#line=229), in AbstractGraph._create_embedder(self, embedder_config)
    229 try:
--> 230     models_tokens["ollama"][embedder_config["model"]]
    231 except KeyError:

KeyError: 'nomic-embed-text'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
Cell In[1], line 16
      1 from scrapegraphai.graphs import SmartScraperGraph
      3 graph_config = {
      4     "llm": {
      5         "model": "ollama[/llama3](http://localhost:8888/llama3)",
   (...)
     13     }
     14 }
---> 16 smart_scraper_graph = SmartScraperGraph(
     17     prompt="List me all the articles",
     18     # also accepts a string with the already downloaded HTML code
     19     source="https://perinim.github.io/projects",
     20     config=graph_config
     21 )
     23 result = smart_scraper_graph.run()
     24 print(result)

File ~/venv/lib/python3.11/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:47, in SmartScraperGraph.__init__(self, prompt, source, config)
     46 def __init__(self, prompt: str, source: str, config: dict):
---> 47     super().__init__(prompt, config, source)
     49     self.input_key = "url" if source.startswith("http") else "local_dir"

File [~/venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py:51](http://localhost:8888/lab/tree/~/Documents/projects/techtonga-reddit/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py#line=50), in AbstractGraph.__init__(self, prompt, config, source)
     48 self.config = config
     49 self.llm_model = self._create_llm(config["llm"], chat=True)
     50 self.embedder_model = self._create_default_embedder(    
---> 51     ) if "embeddings" not in config else self._create_embedder(
     52     config["embeddings"])
     54 # Set common configuration parameters
     55 self.verbose = True if config is None else config.get("verbose", False)

File [~/venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py:232](http://localhost:8888/lab/tree/~/Documents/projects/techtonga-reddit/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py#line=231), in AbstractGraph._create_embedder(self, embedder_config)
    230         models_tokens["ollama"][embedder_config["model"]]
    231     except KeyError:
--> 232         raise KeyError("Model not supported")
    233     return OllamaEmbeddings(**embedder_config)
    235 elif "hugging_face" in embedder_config["model"]:

KeyError: 'Model not supported'

Expected behavior
I wanted to run the example

Desktop (please complete the following information):

  • OS: MacOS 14.31.
  • Browser Chrome

Additional context
Ollama with llama3 and nomic-embed-text are working ok outside this project.
image

Create for, while and switch nodes

Create new folder outside nodes called conditional nodes and add some new nodes that given an input a specific output will be chosen

  • for node: given an input iterates over that nodes and returns an output
  • switch/if node: given a state and a condition choose a condition and give the name (string) of the next node

Groq model implementation

Is your feature request related to a problem? Please describe.
Groq is not supported

Describe the solution you'd like
Implement this model in the list of avaible model for Scrapegraph
VinciGit00/Scrapegraph-ai/assets/38807022/599fe445-2108-4a72-9570-f70c948457a0)

Describe alternatives you've considered
none

Additional context
Issue from discord

insufficient_quota

Describe the bug
keep getting insufficient_quota when testing with google colab.

To Reproduce
run the colab example

Expected behavior
it is probably rate limiting because it is trying to call openai too fast.

Screenshots
run the colab example

Additional context
I think if there is a parameter to adjust the rate it is accessing openai API as a parameter to this call smart_scraper_graph.run(), it should work fine.

anotehr approach would be to support some open hosted LLMs, like together.ai llama-3-70b, and it may not have this issue.

Azure Example

Hi there, love the idea of the package! I'm looking to use my Azure instance to run this and I can't seem to find any examples of how to do it in the docs and haven't been able to get something working.

This is the starting point of my Azure-based LLM usage. How can I go from this to using this package with one of the models I have hosted there?

Thanks in advance.


from openai import AzureOpenAI
client = AzureOpenAI(
    azure_endpoint=my_endpoint,
    api_key=my_api_key,
    api_version=my_api_version
)

message_text = [{"role":"system","content":"You are an AI assistant that helps people find information."}]

completion = client.chat.completions.create(
    model="model_name",
    messages=message_text,
    ...
)

response_str = completion.choices[0].message.content.strip()

Scraping n levels deep

Is your feature request related to a problem? Please describe.
I'd like to scrape a website n-levels deep.

Describe the solution you'd like
For example, given url = example.com, the scraper should also follow the links in example.com and scrape those too

Describe alternatives you've considered
I can use BeautifulSoup and download the pages and then feed them to this

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.