vincigit00 / scrapegraph-ai Goto Github PK

View Code? Open in Web Editor NEW

12.9K 87.0 977.0 8.36 MB

Python scraper based on AI

Home Page: https://scrapegraphai.com

License: MIT License

Python 99.34% Shell 0.62% Dockerfile 0.05%

machine-learning scraping scraping-python scrapingweb automated-scraper sc gpt-3 gpt-4 llm llama3

scrapegraph-ai's Introduction

Hi there, I'm Marco Vinciguerra - aka Vinci 👋

I am a 24 year old boy and engineering student whose dream is to work in a big tech like Google. Every day I work hard to make it happen 💪🏻

I am currently working on Scrapegraph-ai.

🔭 I’m currently studing data science and data engineering master's degree course

🌱 I’m currently learning, Flutter, Math and Machine Learning

🥅 2024 Goals: Became a better developer and a better person

⚡ Fun fact: I love reading books and my favourite IDE is Neovim.

📫 How to reach me:

Tools and coding languages that I use

scrapegraph-ai's People

Contributors

Stargazers

Watchers

Forkers

perinim marcovinciguerra subzero-team lihuibng seshubonam merit-metrics tanio253 ftoppi fattimei vedovati-matteo arsaboo dpende subzero-team gururaja-ai hoooni polya20 iwillcodeu abhijeetkb06 arrietafernando cognimach moxmoussa bananemure mivanovitch sbogutyn giocarbo sorokinvld donwany anupgoenka hotelzululima nopeanuts briancg42 uy-nguyen00 medinlavie llathieyre desinghdel jesusoctavioas rajendharmendra w3ss bryceindata amin-aiolos-cloud imprateeksh mansh7763 michaelwnau christogh daoyuan14 hungle90 guezpatzchio assignments-introduction-to-ai legaltextai chispasgg rhinojosa aria1991 xfarooqi lance911 saugatach shaunchew zgrmjfj-903 harisxue beimingmaster cryptoxunm toozig dyl777 dr-data dmillner galaxycenter veryvanya seemirra redrusty2 avineshwar keating666 daanferdinandusse alexanderzhaoofchina suepradun symbiose-as sumeshks1987 aaronjessen jfeng3 koufuchi zhaopufeng b08240 gitb33rman kenny1216 kendralabs epage480 drroad donovoi f901107 shorthills-ai nixent mostrub s4mpl3r vitamin3615 jfontestad farvision2 lazybouy amit1nayak wodole zeroxclem jonathanbrivio prompted365

scrapegraph-ai's Issues

Implement Graph Visualization with Graphviz

Description:

To enhance our understanding and debugging capabilities of the scraping workflows, we aim to introduce a feature for visualizing the graph structures using Graphviz. This visualization will help in quickly identifying the flow, relationships between nodes, and potential bottlenecks in the graph.

Tasks:

Research Graphviz syntax and integration options with Python;
Develop a utility function or class method that converts our current graph structure into a Graphviz-compatible format;
Ensure the visualization includes node names, types, and directional edges.

Adding debugging mode

Is your feature request related to a problem? Please describe.
I am having a challenge finding out where my project is going wrong. When I run a prompt I receive {} as the output and don't get any further information as to why I got that as the result.

Describe the solution you'd like
Some sort of debugging mode with higher level of detail around each stage of the pipeline and what succeeded and failed to get down to the root of the issue.

How to use Azure portal PAT to scrap the data

I use the Azure DevOps portal for managing bug data, which I currently access using a Personal Access Token (PAT) and retrieve data through a Python script via APIs. Is it possible to integrate Scrapegraph-AI for this use case?

Playwright Implementation

Is your feature request related to a problem? Please describe.
Javascript-based website often failed to be fetched.

Describe the solution you'd like
Add the option to fetch the html using playwright, to add inside the fetch_node class.

Additional context
Try to make it headless

Malicious data inside DOM

What happens if some malicious data is hidden (not visible to the regular human user) in the DOM?

For example:

Article 1

Article 2

Bomb!

//this is the hidden one

What is the output with a prompt like "Extract all the titles"? Is there a sanitization of the DOM before AI processing?
Morever, can content influence how the AI "see" the data of the DOM?

running in colab notebook fails with err

result = smart_scraper_graph.run()
the line above results in the error below:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-7-2af99396d10b>](https://localhost:8080/#) in <cell line: 1>()
----> 1 result = smart_scraper_graph.run()

5 frames
[/usr/lib/python3.10/asyncio/runners.py](https://localhost:8080/#) in run(main, debug)
     31     """
     32     if events._get_running_loop() is not None:
---> 33         raise RuntimeError(
     34             "asyncio.run() cannot be called from a running event loop")
     35 

RuntimeError: asyncio.run() cannot be called from a running event loop

Allow using a remote Ollama

People don't always run apps on the same computer where they run Ollama.
Since you provide a docker-compose.yml to run ollama, you might run your app in another container.

Allow people to set base_url to configure ChatOllama to connect to a different URL than http://localhost:11434 .

AttributeError: 'SmartScraperGraph' object has no attribute 'model_token'

Describe the bug
I just followed the instruction and ran the example and getting the following error

I have verified ollama is running and I can use the models using ollama-webui

I am using following models and those are downloaded and working in webui

ollama/gemma:2b
ollama/nomic-embed-text"

Traceback (most recent call last):
  File "/home/leo/dev/projs/scrapegraph/main.py", line 16, in <module>
    smart_scraper_graph = SmartScraperGraph(
                          ^^^^^^^^^^^^^^^^^^
  File "/home/leo/dev/projs/scrapegraph/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 24, in __init__
    super().__init__(prompt, config, source)
  File "/home/leo/dev/projs/scrapegraph/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py", line 25, in __init__
    self.graph = self._create_graph()
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/leo/dev/projs/scrapegraph/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 39, in _create_graph
    node_config={"chunk_size": self.model_token}
                               ^^^^^^^^^^^^^^^^
AttributeError: 'SmartScraperGraph' object has no attribute 'model_token'

Process finished with exit code 1

To Reproduce
Steps to reproduce the behavior:

Run Ollama using docker
copy paste the example
install scrapegraphai
See error

Expected behavior
It should work seamlessly

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Ubuntu 23.10

add security policy

Make the tests

In the section tests

add save to csv function

create docosuarus implementation

add a new http request method

merging the branches

Add tests for the nodes and graphs

Library components interaction

Is your feature request related to a problem? Please describe.
Difficulty to understand how it works under the lines and how it is structured

Describe the solution you'd like
A scheme showing what are the main modules contained in the library and how they interact with each other.
Showing different types of graphs and nodes etc.

The job post does not contain any body content.

Describe the bug
I'm trying to scrape the content of a page, asking to summarize the job post. What I get as answer is the following:
"The job post does not contain any body content."

To Reproduce
Scrape for example the following page

Expected behavior
Getting a summary of the page content.

Fix releases in github

In order to make clear to the users the current release i suggest to implement a github action for release.

Basically move from that:

To that:

Scrape JSON files

Is your feature request related to a problem? Please describe.
Would like to get information from a JSON file / string.

Describe the solution you'd like
I want to be able to use scraping graphs like SmartScraper with json input. Maybe creating a new graph or node

remove pedantic_class.py

blockScraper implementation

Is your feature request related to a problem? Please describe.
A scraper pipeline capable of retrieve all the similar blocks in a page, like ecommerce, weather, fly websites

Describe the solution you'd like
I have found this paper https://www.researchgate.net/publication/261360247_A_Web_Page_Segmentation_Approach_Using_Visual_Semantics
It deals specifically wti this issue

Describe alternatives you've considered
nope

Additional context

Add automatic build of the read docs

it should run sphinx-apidoc -o docs/source/modules yosoai -f

Fix XML example

Describe the bug
XML example file parse the input as if it had html content. The problem is related to the remover() function inside the fetch_node class.

To Reproduce
Steps to reproduce the behavior:

Run examples/openai/scrape_xml_openai.py
Warning:

--- Executing Fetch Node ---
C:\Python\Python311\Lib\html\parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument features="xml" into the BeautifulSoup constructor.
k = self.parse_starttag(i)
--- Executing Parse Node ---

Output:

{'books': [{'title': "XML Developer's Guide", 'author': 'Unknown', 'genre': 'Unknown'}]}

Expected behavior
Do not raise exceptions

Screenshots
No screenshots

Desktop (please complete the following information):

OS: Windows
Browser [e.g. chrome, safari]
Library Version: 0.4.1

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
No additional context

OpenAI Error 429: Rate limit exceeded

nvm

Hugging Face API

Would it be possible to add the Hugging Face API to access many different models?

Ollama embedding model is always `llama2`

You don't configure OllamaEmbeddings in rag_node.py, therefore it is always defined with the default settings.

https://github.com/VinciGit00/Scrapegraph-ai/blob/8ef49a03e007428eea91c9bd4ee69e27256c1be1/scrapegraphai/nodes/rag_node.py#L94

Default settings: https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.ollama.OllamaEmbeddings.html#langchain_community.embeddings.ollama.OllamaEmbeddings

I guess you didn't notice because you started with llama2.

Add getState() function to Abstract Graph

Return the complete final state dictionary of the graph.
So, for example it can be used as:

final_state = smart_scraper.getState()

It could be possible to add an additional arguments to directly specify the key you want from the state

urls = smart_scraper.getState(key="urls")

add an asynchronous way to write the file

Update contributing.md with new policy ci/cd

Add in the contribuiting.md the new ci/cd workflow.

People that want to contribute have the create a branch from pre/beta, once the development is done, merge on the pre/beta, using semantic release keywords

Set up documentation page

Set-up the documentation using Sphinx and some theme like this

Add verbosity flag to remove print statements

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

The print statements make the outputs too verbose - it would be useful to optionally disable them using a flag

Describe the solution you'd like
A clear and concise description of what you want to happen.

Add an optional flag in the node classes to remove print statements

Ollama embeddings not using base_url

I am trying the example code given with Ollama.

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/llama3",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        "base_url": "http://192.168.2.162:11434",  # set Ollama URL arbitrarily
        "model_tokens": 4000,
    },
    "embeddings": {
        "model": "ollama/mxbai-embed-large",
        "temperature": 0,
        "base_url": "http://192.168.2.162:11434",  # set Ollama URL arbitrarily
    }
}

However, it appears that the base_url is completely ignored for the embeddings.

I was able to run it with ollama/nomic-embed-textembedding model (which is on my local computer). If I choose any other embedding model, such asollama/snowflake-arctic-embed:latestorollama/mxbai-embed-large, which are available on my server (on 192.168.2.162), get the following error (note that llama3is not available on my local computer and is correctly being used from thebase_url`):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], [line 15](vscode-notebook-cell:?execution_count=7&line=15)
      [9](vscode-notebook-cell:?execution_count=7&line=9) for source in sources:
     [10](vscode-notebook-cell:?execution_count=7&line=10)     smart_scraper_graph = SmartScraperGraph(
     [11](vscode-notebook-cell:?execution_count=7&line=11)         prompt="Get the Title, Place of Publication, Publisher, Dates of publication, Frequency, and Notes. Don't return anything else.",
     [12](vscode-notebook-cell:?execution_count=7&line=12)         source=source,
     [13](vscode-notebook-cell:?execution_count=7&line=13)         config=graph_config
     [14](vscode-notebook-cell:?execution_count=7&line=14)     )
---> [15](vscode-notebook-cell:?execution_count=7&line=15)     result = smart_scraper_graph.run()
     [16](vscode-notebook-cell:?execution_count=7&line=16)     results.append(result)

File [c:\Users\arsab\anaconda3\lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py:74](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:74), in SmartScraperGraph.run(self)
     [70](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:70) """
     [71](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:71) Executes the web scraping process and returns the answer to the prompt.
     [72](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:72) """
     [73](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:73) inputs = {"user_prompt": self.prompt, self.input_key: self.source}  
---> [74](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:74) self.final_state, self.execution_info = self.graph.execute(inputs)
     [76](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:76) return self.final_state.get("answer", "No answer found.")

File [c:\Users\arsab\anaconda3\lib\site-packages\scrapegraphai\graphs\base_graph.py:83](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:83), in BaseGraph.execute(self, initial_state)
     [80](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:80) current_node = self.nodes[current_node_name]
     [82](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:82) with get_openai_callback() as cb:
---> [83](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:83)     result = current_node.execute(state)
     [85](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:85)     node_exec_time = time.time() - curr_time
     [86](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:86)     total_exec_time += node_exec_time

File [c:\Users\arsab\anaconda3\lib\site-packages\scrapegraphai\nodes\rag_node.py:101](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:101), in RAGNode.execute(self, state)
     [98](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:98) else:
     [99](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:99)     raise ValueError("Embedding Model missing or not supported")
--> [101](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:101) retriever = FAISS.from_documents(
    [102](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:102)     chunked_docs, embeddings).as_retriever()
    [104](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:104) redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
    [105](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:105) # similarity_threshold could be set, now k=20

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_core\vectorstores.py:550](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_core/vectorstores.py:550), in VectorStore.from_documents(cls, documents, embedding, **kwargs)
    [548](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_core/vectorstores.py:548) texts = [d.page_content for d in documents]
    [549](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_core/vectorstores.py:549) metadatas = [d.metadata for d in documents]
--> [550](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_core/vectorstores.py:550) return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_community\vectorstores\faiss.py:930](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:930), in FAISS.from_texts(cls, texts, embedding, metadatas, ids, **kwargs)
    [903](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:903) @classmethod
    [904](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:904) def from_texts(
    [905](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:905)     cls,
   (...)
    [910](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:910)     **kwargs: Any,
    [911](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:911) ) -> FAISS:
    [912](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:912)     """Construct FAISS wrapper from raw documents.
    [913](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:913) 
    [914](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:914)     This is a user friendly interface that:
   (...)
    [928](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:928)             faiss = FAISS.from_texts(texts, embeddings)
    [929](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:929)     """
--> [930](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:930)     embeddings = embedding.embed_documents(texts)
    [931](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:931)     return cls.__from(
    [932](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:932)         texts,
    [933](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:933)         embeddings,
   (...)
    [937](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:937)         **kwargs,
    [938](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:938)     )

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_community\embeddings\ollama.py:211](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:211), in OllamaEmbeddings.embed_documents(self, texts)
    [202](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:202) """Embed documents using an Ollama deployed embedding model.
    [203](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:203) 
    [204](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:204) Args:
   (...)
    [208](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:208)     List of embeddings, one for each text.
    [209](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:209) """
    [210](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:210) instruction_pairs = [f"{self.embed_instruction}{text}" for text in texts]
--> [211](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:211) embeddings = self._embed(instruction_pairs)
    [212](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:212) return embeddings

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_community\embeddings\ollama.py:199](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:199), in OllamaEmbeddings._embed(self, input)
    [197](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:197) else:
    [198](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:198)     iter_ = input
--> [199](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:199) return [self._process_emb_response(prompt) for prompt in iter_]

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_community\embeddings\ollama.py:199](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:199), in <listcomp>(.0)
    [197](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:197) else:
    [198](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:198)     iter_ = input
--> [199](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:199) return [self._process_emb_response(prompt) for prompt in iter_]

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_community\embeddings\ollama.py:173](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:173), in OllamaEmbeddings._process_emb_response(self, input)
    [170](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:170)     raise ValueError(f"Error raised by inference endpoint: {e}")
    [172](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:172) if res.status_code != 200:
--> [173](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:173)     raise ValueError(
    [174](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:174)         "Error raised by inference API HTTP code: %s, %s"
    [175](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:175)         % (res.status_code, res.text)
    [176](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:176)     )
    [177](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:177) try:
    [178](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:178)     t = res.json()

ValueError: Error raised by inference API HTTP code: 404, {"error":"model 'mxbai-embed-large' not found, try pulling it first"}

To confirm my suspicion, I pulled mxbai-embed-large on my local computer and then I was able to run the same.

Convert Graphviz Diagrams to Scraping Class Structures

Description:

Building on our graph visualization capabilities, we aim to develop a feature that allows the creation of scraping class structures from Graphviz diagrams. This will enable rapid prototyping of scraping workflows from visual designs.

Tasks:

Define a standard format or conventions for Graphviz diagrams representing scraping workflows;
Implement a parser that reads Graphviz diagram files and extracts node and edge information;
Develop logic to generate Python class definitions for nodes and graphs based on the parsed diagram.

Please add examples to the documentation, struggling with understanding how it works

Wonderful! I was looking for something like this library!

But I found difficult to use the library because there are no examples.

For instance, I need to extract the name, the province and the address of all cinema theaters listed on this website:

PROVINCES
https://www.mymovies.it/cinema/

The information is divided on 3 levels: the above page contains the list of provinces, each cuty province links to the names of the theaters:

PROVINCES->THEATERS
https://www.mymovies.it/cinema/roma/

And each theatre name links to the address:

PROVINCES->THEATRES->ADDRESS
https://www.mymovies.it/cinema/roma/5157/

Sometimes there is a 4th level for smaller towns inside the province area of the city:

PROVINCES->SMALL_TOWNS
https://www.mymovies.it/cinema/roma/fianoromano/

PROVINCES->SMALL_TOWNS->THEATRES->ADDRESS
https://www.mymovies.it/cinema/roma/fianoromano/5102/

How am I supposed to ask that and get back the list of all Italian cinema theatres names and address divided by province?

Thanks in advance!

Allow setting Ollama models context length within config

Many people don't use the bare model name (mistral) and pull a specific tag (mistral:7b-instruct-v0.2-q6_K).

Allow the user to set a context length in the config if it is not found in helpers/models_tokens.py.

Create Scraping Graphs Based on User Prompt

Description:

To simplify the creation of LLM-powered scraping graphs, we propose the development of a Graph Builder class. This class will abstract the complexity of graph construction, allowing users to create powerful scraping workflows using only a natural language prompt and selecting from available nodes in the library.

Tasks:

Design the Graph Builder class interface, focusing on ease of use and flexibility;
Implement logic to interpret the prompt and select appropriate nodes and configurations;
Integrate with the existing LLM setup to ensure seamless generation of answers and interactions within the graph.

Change the algorithm for zipping data before the request

the function scraper inside AmazScraper/utils/getter.py should be fixed

ModuleNotFoundError: No module named 'fp'

Describe the bug

I run https://github.com/VinciGit00/Scrapegraph-ai/blob/main/examples/openai/smart_scraper_openai.py and show error: ModuleNotFoundError: No module named 'fp'

To Reproduce
I run the following example and get error

"""
Basic example of scraping pipeline using SmartScraper
"""

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()


# ************************************************
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("OPENAI_APIKEY")

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "gpt-3.5-turbo",
    },
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their description.",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects/",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))


D:\Application\Conda\envs\agentkit\python.exe D:\Projects\githubhub\agentkit\tmp.py 
Traceback (most recent call last):
  File "D:\Projects\githubhub\agentkit\tmp.py", line 7, in <module>
    from scrapegraphai.graphs import SmartScraperGraph
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\graphs\__init__.py", line 5, in <module>
    from .smart_scraper_graph import SmartScraperGraph
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py", line 5, in <module>
    from ..nodes import (
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\nodes\__init__.py", line 5, in <module>
    from .fetch_node import FetchNode
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\nodes\fetch_node.py", line 9, in <module>
    from ..utils.remover import remover
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\utils\__init__.py", line 8, in <module>
    from .proxy_rotation import proxy_generator
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\utils\proxy_rotation.py", line 4, in <module>
    from fp.fp import FreeProxy
ModuleNotFoundError: No module named 'fp'

BaseGraph does not inherit from AbstractGraph

Is this intentional?

Because BaseGraph does not inherit from AbstractGraph it does not have access to some of the useful methods such as get_execution_info().

Method to pass common params to all the nodes in the graph

Is your feature request related to a problem? Please describe.
There is no way to pass many parameters all at once to each node in a graph. For example, the verbose flag would be nice to have it already in all the nodes.

What I am doing now is defining the params inside the AbstractGraph (which is inherited in all the scraping graphs) with something like:

and then pass it in each node explicitily, for example inside SmartScraperGraph:

You may understand that this method doesn't scale up well.

Describe the solution you'd like
I would like a function like this inside AbstractGraph that updates all the node_config dict for each node:

Describe alternatives you've considered
None

Additional context
None

Ability to use locally hosted AI Model

Is your feature request related to a problem? Please describe.
Using OpenAI's presents known privacy problems.

Describe the solution you'd like
The ability local models like
TinyLlama 1.1B v1.0
OpenChat 3.5 0106

Describe alternatives you've considered
I am on the lookout

The ability to use OpenAI embeddings with Groq

Hey there! It would be great if we could use OpenAI embeddings (or any other supported API-based embedding models) with Groq (or any other supported llm). With the current way the code is organized, you can only use OpenAI embeddings with OpenAI models. If I want to use Groq as my main llm, I would have to use Ollama, which is ok if you want to run models locally. But I don't want to install models on my local machine, I would prefer to use OpenAI as my embedder service.

One way to add this is to change the way self.embedder_model is initialized in the AbstractGraph class. Currently, both self.llm_model and self.embedder_model are initialized using one method, self._create_llm(), which kinda makes our options limited. One possible solution is to add another function e.g. self._create_embedder() and completely separate the logic for initialization of llms and embedder models.

nomic-embed-text KeyError: 'Model not supported'

Describe the bug
I am trying to run the example from the readme.md file using ollama but the following error is raised:

    229 try:
--> 230     models_tokens["ollama"][embedder_config["model"]]
    231 except KeyError:

KeyError: 'nomic-embed-text'

To Reproduce
Steps to reproduce the behavior:

Install Ollama and Scrapergraph-ai
Create a script with the following code:

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/llama3",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        "base_url": "http://localhost:11434",  # set Ollama URL
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",  # set Ollama URL
    }
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

See error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File [~/venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py:230](http://localhost:8888/lab/tree/~/Documents/projects/techtonga-reddit/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py#line=229), in AbstractGraph._create_embedder(self, embedder_config)
    229 try:
--> 230     models_tokens["ollama"][embedder_config["model"]]
    231 except KeyError:

KeyError: 'nomic-embed-text'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
Cell In[1], line 16
      1 from scrapegraphai.graphs import SmartScraperGraph
      3 graph_config = {
      4     "llm": {
      5         "model": "ollama[/llama3](http://localhost:8888/llama3)",
   (...)
     13     }
     14 }
---> 16 smart_scraper_graph = SmartScraperGraph(
     17     prompt="List me all the articles",
     18     # also accepts a string with the already downloaded HTML code
     19     source="https://perinim.github.io/projects",
     20     config=graph_config
     21 )
     23 result = smart_scraper_graph.run()
     24 print(result)

File ~/venv/lib/python3.11/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:47, in SmartScraperGraph.__init__(self, prompt, source, config)
     46 def __init__(self, prompt: str, source: str, config: dict):
---> 47     super().__init__(prompt, config, source)
     49     self.input_key = "url" if source.startswith("http") else "local_dir"

File [~/venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py:51](http://localhost:8888/lab/tree/~/Documents/projects/techtonga-reddit/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py#line=50), in AbstractGraph.__init__(self, prompt, config, source)
     48 self.config = config
     49 self.llm_model = self._create_llm(config["llm"], chat=True)
     50 self.embedder_model = self._create_default_embedder(    
---> 51     ) if "embeddings" not in config else self._create_embedder(
     52     config["embeddings"])
     54 # Set common configuration parameters
     55 self.verbose = True if config is None else config.get("verbose", False)

File [~/venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py:232](http://localhost:8888/lab/tree/~/Documents/projects/techtonga-reddit/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py#line=231), in AbstractGraph._create_embedder(self, embedder_config)
    230         models_tokens["ollama"][embedder_config["model"]]
    231     except KeyError:
--> 232         raise KeyError("Model not supported")
    233     return OllamaEmbeddings(**embedder_config)
    235 elif "hugging_face" in embedder_config["model"]:

KeyError: 'Model not supported'

Expected behavior
I wanted to run the example

Desktop (please complete the following information):

OS: MacOS 14.31.
Browser Chrome

Additional context
Ollama with llama3 and nomic-embed-text are working ok outside this project.

Create for, while and switch nodes

Create new folder outside nodes called conditional nodes and add some new nodes that given an input a specific output will be chosen

for node: given an input iterates over that nodes and returns an output
switch/if node: given a state and a condition choose a condition and give the name (string) of the next node

Groq model implementation

Is your feature request related to a problem? Please describe.
Groq is not supported

Describe the solution you'd like
Implement this model in the list of avaible model for Scrapegraph
VinciGit00/Scrapegraph-ai/assets/38807022/599fe445-2108-4a72-9570-f70c948457a0)

Describe alternatives you've considered
none

Additional context
Issue from discord

Update Readthedocs Documentation

Update the documentation which can be found in Readthedocs.
Sphinx is used for handling the documents generation.
The files can be modified and added inside the folder docs/source of the repo

Support Claude3 haiku and others using litellm

Claude 3 haiku is cheaper than gpt3.5 and better, adding support for many other models and platforms can easily be completed by using litellm

insufficient_quota

Describe the bug
keep getting insufficient_quota when testing with google colab.

To Reproduce
run the colab example

Expected behavior
it is probably rate limiting because it is trying to call openai too fast.

Screenshots
run the colab example

Additional context
I think if there is a parameter to adjust the rate it is accessing openai API as a parameter to this call smart_scraper_graph.run(), it should work fine.

anotehr approach would be to support some open hosted LLMs, like together.ai llama-3-70b, and it may not have this issue.

Azure Example

Hi there, love the idea of the package! I'm looking to use my Azure instance to run this and I can't seem to find any examples of how to do it in the docs and haven't been able to get something working.

This is the starting point of my Azure-based LLM usage. How can I go from this to using this package with one of the models I have hosted there?

Thanks in advance.


from openai import AzureOpenAI
client = AzureOpenAI(
    azure_endpoint=my_endpoint,
    api_key=my_api_key,
    api_version=my_api_version
)

message_text = [{"role":"system","content":"You are an AI assistant that helps people find information."}]

completion = client.chat.completions.create(
    model="model_name",
    messages=message_text,
    ...
)

response_str = completion.choices[0].message.content.strip()

Scraping n levels deep

Is your feature request related to a problem? Please describe.
I'd like to scrape a website n-levels deep.

Describe the solution you'd like
For example, given url = example.com, the scraper should also follow the links in example.com and scrape those too

Describe alternatives you've considered
I can use BeautifulSoup and download the pages and then feed them to this

vincigit00 / scrapegraph-ai Goto Github PK

scrapegraph-ai's Introduction

Tools and coding languages that I use

scrapegraph-ai's People

Contributors

Stargazers

Watchers

Forkers

scrapegraph-ai's Issues

Description:

Tasks:

Description:

Tasks:

Description:

Tasks:

Recommend Projects

Recommend Topics

Recommend Org