Code Monkey home page Code Monkey logo

scrapegraph-ai's Introduction

πŸ•·οΈ ScrapeGraphAI: You Only Scrape Once

Downloads linting: pylint Pylint CodeQL License: MIT

ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).

Just say which information you want to extract and the library will do it for you!

Scrapegraph-ai Logo

πŸš€ Quick install

The reference page for Scrapegraph-ai is available on the official page of pypy: pypi.

pip install scrapegraphai

you will also need to install Playwright for javascript-based scraping:

playwright install

Note: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries 🐱

πŸ” Demo

Official streamlit demo:

My Skills

Try it directly on the web using Google Colab:

Open In Colab

πŸ“– Documentation

The documentation for ScrapeGraphAI can be found here.

Check out also the Docusaurus here.

πŸ’» Usage

There are three main scraping pipelines that can be used to extract information from a website (or local file):

  • SmartScraperGraph: single-page scraper that only needs a user prompt and an input source;
  • SearchGraph: multi-page scraper that extracts information from the top n search results of a search engine;
  • SpeechGraph: single-page scraper that extracts information from a website and generates an audio file.

It is possible to use different LLM through APIs, such as OpenAI, Groq, Azure and Gemini, or local models using Ollama.

Case 1: SmartScraper using Local Models

Remember to have Ollama installed and download the models using the ollama pull command.

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/mistral",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        "base_url": "http://localhost:11434",  # set Ollama URL
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",  # set Ollama URL
    },
    "verbose": True,
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their descriptions",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

The output will be a list of projects with their descriptions like the following:

{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}

Case 2: SearchGraph using Mixed Models

We use Groq for the LLM and Ollama for the embeddings.

from scrapegraphai.graphs import SearchGraph

# Define the configuration for the graph
graph_config = {
    "llm": {
        "model": "groq/gemma-7b-it",
        "api_key": "GROQ_API_KEY",
        "temperature": 0
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",  # set ollama URL arbitrarily
    },
    "max_results": 5,
}

# Create the SearchGraph instance
search_graph = SearchGraph(
    prompt="List me all the traditional recipes from Chioggia",
    config=graph_config
)

# Run the graph
result = search_graph.run()
print(result)

The output will be a list of recipes like the following:

{'recipes': [{'name': 'Sarde in SaΓ²re'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}

Case 3: SpeechGraph using OpenAI

You just need to pass the OpenAI API key and the model name.

from scrapegraphai.graphs import SpeechGraph

graph_config = {
    "llm": {
        "api_key": "OPENAI_API_KEY",
        "model": "gpt-3.5-turbo",
    },
    "tts_model": {
        "api_key": "OPENAI_API_KEY",
        "model": "tts-1",
        "voice": "alloy"
    },
    "output_path": "audio_summary.mp3",
}

# ************************************************
# Create the SpeechGraph instance and run it
# ************************************************

speech_graph = SpeechGraph(
    prompt="Make a detailed audio summary of the projects.",
    source="https://perinim.github.io/projects/",
    config=graph_config,
)

result = speech_graph.run()
print(result)

The output will be an audio file with the summary of the projects on the page.

🀝 Contributing

Feel free to contribute and join our Discord server to discuss with us improvements and give us suggestions!

Please see the contributing guidelines.

My Skills My Skills My Skills

πŸ“ˆ Roadmap

Check out the project roadmap here! πŸš€

Wanna visualize the roadmap in a more interactive way? Check out the markmap visualization by copy pasting the markdown content in the editor!

❀️ Contributors

Contributors

Sponsors

SerpAPI

πŸŽ“ Citations

If you have used our library for research purposes please quote us with the following reference:

  @misc{scrapegraph-ai,
    author = {Marco Perini, Lorenzo Padoan, Marco Vinciguerra},
    title = {Scrapegraph-ai},
    year = {2024},
    url = {https://github.com/VinciGit00/Scrapegraph-ai},
    note = {A Python library for scraping leveraging large language models}
  }

Authors

Authors_logos

Contact Info
Marco Vinciguerra Linkedin Badge
Marco Perini Linkedin Badge
Lorenzo Padoan Linkedin Badge

πŸ“œ License

ScrapeGraphAI is licensed under the MIT License. See the LICENSE file for more information.

Acknowledgements

  • We would like to thank all the contributors to the project and the open-source community for their support.
  • ScrapeGraphAI is meant to be used for data exploration and research purposes only. We are not responsible for any misuse of the library.

scrapegraph-ai's People

Contributors

andrearoota avatar arjuuuuunnnnn avatar arsaboo avatar cemkod avatar daniele-roncaglioni avatar darvat avatar dependabot[bot] avatar dito97 avatar dpende avatar eltociear avatar epage480 avatar erjanmx avatar f-aguzzi avatar fattimei avatar ftoppi avatar jgalego avatar kahwoo avatar kpcofgs avatar lurenss avatar mayurdb avatar mudler avatar perinim avatar s4mpl3r avatar semantic-release-bot avatar shkamboj1 avatar spulci avatar vedovati-matteo avatar vincigit00 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapegraph-ai's Issues

The ability to use OpenAI embeddings with Groq

Hey there! It would be great if we could use OpenAI embeddings (or any other supported API-based embedding models) with Groq (or any other supported llm). With the current way the code is organized, you can only use OpenAI embeddings with OpenAI models. If I want to use Groq as my main llm, I would have to use Ollama, which is ok if you want to run models locally. But I don't want to install models on my local machine, I would prefer to use OpenAI as my embedder service.

One way to add this is to change the way self.embedder_model is initialized in the AbstractGraph class. Currently, both self.llm_model and self.embedder_model are initialized using one method, self._create_llm(), which kinda makes our options limited. One possible solution is to add another function e.g. self._create_embedder() and completely separate the logic for initialization of llms and embedder models.

How to use Azure portal PAT to scrap the data

I use the Azure DevOps portal for managing bug data, which I currently access using a Personal Access Token (PAT) and retrieve data through a Python script via APIs. Is it possible to integrate Scrapegraph-AI for this use case?

Ollama embeddings not using base_url

I am trying the example code given with Ollama.

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/llama3",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        "base_url": "http://192.168.2.162:11434",  # set Ollama URL arbitrarily
        "model_tokens": 4000,
    },
    "embeddings": {
        "model": "ollama/mxbai-embed-large",
        "temperature": 0,
        "base_url": "http://192.168.2.162:11434",  # set Ollama URL arbitrarily
    }
}

However, it appears that the base_url is completely ignored for the embeddings.

I was able to run it with ollama/nomic-embed-textembedding model (which is on my local computer). If I choose any other embedding model, such asollama/snowflake-arctic-embed:latestorollama/mxbai-embed-large, which are available on my server (on 192.168.2.162), get the following error (note that llama3is not available on my local computer and is correctly being used from thebase_url`):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], [line 15](vscode-notebook-cell:?execution_count=7&line=15)
      [9](vscode-notebook-cell:?execution_count=7&line=9) for source in sources:
     [10](vscode-notebook-cell:?execution_count=7&line=10)     smart_scraper_graph = SmartScraperGraph(
     [11](vscode-notebook-cell:?execution_count=7&line=11)         prompt="Get the Title, Place of Publication, Publisher, Dates of publication, Frequency, and Notes. Don't return anything else.",
     [12](vscode-notebook-cell:?execution_count=7&line=12)         source=source,
     [13](vscode-notebook-cell:?execution_count=7&line=13)         config=graph_config
     [14](vscode-notebook-cell:?execution_count=7&line=14)     )
---> [15](vscode-notebook-cell:?execution_count=7&line=15)     result = smart_scraper_graph.run()
     [16](vscode-notebook-cell:?execution_count=7&line=16)     results.append(result)

File [c:\Users\arsab\anaconda3\lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py:74](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:74), in SmartScraperGraph.run(self)
     [70](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:70) """
     [71](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:71) Executes the web scraping process and returns the answer to the prompt.
     [72](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:72) """
     [73](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:73) inputs = {"user_prompt": self.prompt, self.input_key: self.source}  
---> [74](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:74) self.final_state, self.execution_info = self.graph.execute(inputs)
     [76](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:76) return self.final_state.get("answer", "No answer found.")

File [c:\Users\arsab\anaconda3\lib\site-packages\scrapegraphai\graphs\base_graph.py:83](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:83), in BaseGraph.execute(self, initial_state)
     [80](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:80) current_node = self.nodes[current_node_name]
     [82](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:82) with get_openai_callback() as cb:
---> [83](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:83)     result = current_node.execute(state)
     [85](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:85)     node_exec_time = time.time() - curr_time
     [86](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/graphs/base_graph.py:86)     total_exec_time += node_exec_time

File [c:\Users\arsab\anaconda3\lib\site-packages\scrapegraphai\nodes\rag_node.py:101](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:101), in RAGNode.execute(self, state)
     [98](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:98) else:
     [99](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:99)     raise ValueError("Embedding Model missing or not supported")
--> [101](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:101) retriever = FAISS.from_documents(
    [102](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:102)     chunked_docs, embeddings).as_retriever()
    [104](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:104) redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
    [105](file:///C:/Users/arsab/anaconda3/lib/site-packages/scrapegraphai/nodes/rag_node.py:105) # similarity_threshold could be set, now k=20

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_core\vectorstores.py:550](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_core/vectorstores.py:550), in VectorStore.from_documents(cls, documents, embedding, **kwargs)
    [548](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_core/vectorstores.py:548) texts = [d.page_content for d in documents]
    [549](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_core/vectorstores.py:549) metadatas = [d.metadata for d in documents]
--> [550](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_core/vectorstores.py:550) return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_community\vectorstores\faiss.py:930](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:930), in FAISS.from_texts(cls, texts, embedding, metadatas, ids, **kwargs)
    [903](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:903) @classmethod
    [904](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:904) def from_texts(
    [905](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:905)     cls,
   (...)
    [910](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:910)     **kwargs: Any,
    [911](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:911) ) -> FAISS:
    [912](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:912)     """Construct FAISS wrapper from raw documents.
    [913](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:913) 
    [914](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:914)     This is a user friendly interface that:
   (...)
    [928](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:928)             faiss = FAISS.from_texts(texts, embeddings)
    [929](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:929)     """
--> [930](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:930)     embeddings = embedding.embed_documents(texts)
    [931](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:931)     return cls.__from(
    [932](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:932)         texts,
    [933](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:933)         embeddings,
   (...)
    [937](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:937)         **kwargs,
    [938](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/vectorstores/faiss.py:938)     )

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_community\embeddings\ollama.py:211](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:211), in OllamaEmbeddings.embed_documents(self, texts)
    [202](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:202) """Embed documents using an Ollama deployed embedding model.
    [203](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:203) 
    [204](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:204) Args:
   (...)
    [208](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:208)     List of embeddings, one for each text.
    [209](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:209) """
    [210](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:210) instruction_pairs = [f"{self.embed_instruction}{text}" for text in texts]
--> [211](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:211) embeddings = self._embed(instruction_pairs)
    [212](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:212) return embeddings

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_community\embeddings\ollama.py:199](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:199), in OllamaEmbeddings._embed(self, input)
    [197](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:197) else:
    [198](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:198)     iter_ = input
--> [199](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:199) return [self._process_emb_response(prompt) for prompt in iter_]

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_community\embeddings\ollama.py:199](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:199), in <listcomp>(.0)
    [197](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:197) else:
    [198](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:198)     iter_ = input
--> [199](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:199) return [self._process_emb_response(prompt) for prompt in iter_]

File [c:\Users\arsab\anaconda3\lib\site-packages\langchain_community\embeddings\ollama.py:173](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:173), in OllamaEmbeddings._process_emb_response(self, input)
    [170](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:170)     raise ValueError(f"Error raised by inference endpoint: {e}")
    [172](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:172) if res.status_code != 200:
--> [173](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:173)     raise ValueError(
    [174](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:174)         "Error raised by inference API HTTP code: %s, %s"
    [175](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:175)         % (res.status_code, res.text)
    [176](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:176)     )
    [177](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:177) try:
    [178](file:///C:/Users/arsab/anaconda3/lib/site-packages/langchain_community/embeddings/ollama.py:178)     t = res.json()

ValueError: Error raised by inference API HTTP code: 404, {"error":"model 'mxbai-embed-large' not found, try pulling it first"}

To confirm my suspicion, I pulled mxbai-embed-large on my local computer and then I was able to run the same.

Scrape JSON files

Is your feature request related to a problem? Please describe.
Would like to get information from a JSON file / string.

Describe the solution you'd like
I want to be able to use scraping graphs like SmartScraper with json input. Maybe creating a new graph or node

Library components interaction

Is your feature request related to a problem? Please describe.
Difficulty to understand how it works under the lines and how it is structured

Describe the solution you'd like
A scheme showing what are the main modules contained in the library and how they interact with each other.
Showing different types of graphs and nodes etc.

Create Scraping Graphs Based on User Prompt

Description:

To simplify the creation of LLM-powered scraping graphs, we propose the development of a Graph Builder class. This class will abstract the complexity of graph construction, allowing users to create powerful scraping workflows using only a natural language prompt and selecting from available nodes in the library.

Tasks:

  • Design the Graph Builder class interface, focusing on ease of use and flexibility;
  • Implement logic to interpret the prompt and select appropriate nodes and configurations;
  • Integrate with the existing LLM setup to ensure seamless generation of answers and interactions within the graph.

Scraping n levels deep

Is your feature request related to a problem? Please describe.
I'd like to scrape a website n-levels deep.

Describe the solution you'd like
For example, given url = example.com, the scraper should also follow the links in example.com and scrape those too

Describe alternatives you've considered
I can use BeautifulSoup and download the pages and then feed them to this

Implement Graph Visualization with Graphviz

Description:

To enhance our understanding and debugging capabilities of the scraping workflows, we aim to introduce a feature for visualizing the graph structures using Graphviz. This visualization will help in quickly identifying the flow, relationships between nodes, and potential bottlenecks in the graph.

Tasks:

  • Research Graphviz syntax and integration options with Python;
  • Develop a utility function or class method that converts our current graph structure into a Graphviz-compatible format;
  • Ensure the visualization includes node names, types, and directional edges.

Update contributing.md with new policy ci/cd

Add in the contribuiting.md the new ci/cd workflow.

People that want to contribute have the create a branch from pre/beta, once the development is done, merge on the pre/beta, using semantic release keywords

running in colab notebook fails with err

result = smart_scraper_graph.run()
the line above results in the error below:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-7-2af99396d10b>](https://localhost:8080/#) in <cell line: 1>()
----> 1 result = smart_scraper_graph.run()

5 frames
[/usr/lib/python3.10/asyncio/runners.py](https://localhost:8080/#) in run(main, debug)
     31     """
     32     if events._get_running_loop() is not None:
---> 33         raise RuntimeError(
     34             "asyncio.run() cannot be called from a running event loop")
     35 

RuntimeError: asyncio.run() cannot be called from a running event loop

Hugging Face API

Would it be possible to add the Hugging Face API to access many different models?

insufficient_quota

Describe the bug
keep getting insufficient_quota when testing with google colab.

To Reproduce
run the colab example

Expected behavior
it is probably rate limiting because it is trying to call openai too fast.

Screenshots
run the colab example

Additional context
I think if there is a parameter to adjust the rate it is accessing openai API as a parameter to this call smart_scraper_graph.run(), it should work fine.

anotehr approach would be to support some open hosted LLMs, like together.ai llama-3-70b, and it may not have this issue.

AttributeError: 'SmartScraperGraph' object has no attribute 'model_token'

Describe the bug
I just followed the instruction and ran the example and getting the following error

I have verified ollama is running and I can use the models using ollama-webui

I am using following models and those are downloaded and working in webui

ollama/gemma:2b
ollama/nomic-embed-text"

image

Traceback (most recent call last):
  File "/home/leo/dev/projs/scrapegraph/main.py", line 16, in <module>
    smart_scraper_graph = SmartScraperGraph(
                          ^^^^^^^^^^^^^^^^^^
  File "/home/leo/dev/projs/scrapegraph/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 24, in __init__
    super().__init__(prompt, config, source)
  File "/home/leo/dev/projs/scrapegraph/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py", line 25, in __init__
    self.graph = self._create_graph()
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/leo/dev/projs/scrapegraph/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 39, in _create_graph
    node_config={"chunk_size": self.model_token}
                               ^^^^^^^^^^^^^^^^
AttributeError: 'SmartScraperGraph' object has no attribute 'model_token'

Process finished with exit code 1

To Reproduce
Steps to reproduce the behavior:

  1. Run Ollama using docker
  2. copy paste the example
  3. install scrapegraphai
  4. See error

Expected behavior
It should work seamlessly

Screenshots
If applicable, add screenshots to help explain your problem.
image

Desktop (please complete the following information):

  • OS: Ubuntu 23.10

Method to pass common params to all the nodes in the graph

Is your feature request related to a problem? Please describe.
There is no way to pass many parameters all at once to each node in a graph. For example, the verbose flag would be nice to have it already in all the nodes.

What I am doing now is defining the params inside the AbstractGraph (which is inherited in all the scraping graphs) with something like:

image

and then pass it in each node explicitily, for example inside SmartScraperGraph:

image

You may understand that this method doesn't scale up well.

Describe the solution you'd like
I would like a function like this inside AbstractGraph that updates all the node_config dict for each node:

image

Describe alternatives you've considered
None

Additional context
None

Add verbosity flag to remove print statements

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

The print statements make the outputs too verbose - it would be useful to optionally disable them using a flag

Describe the solution you'd like
A clear and concise description of what you want to happen.

Add an optional flag in the node classes to remove print statements

Playwright Implementation

Is your feature request related to a problem? Please describe.
Javascript-based website often failed to be fetched.

Describe the solution you'd like
Add the option to fetch the html using playwright, to add inside the fetch_node class.

Additional context
Try to make it headless

ModuleNotFoundError: No module named 'fp'

Describe the bug

I run https://github.com/VinciGit00/Scrapegraph-ai/blob/main/examples/openai/smart_scraper_openai.py and show error: ModuleNotFoundError: No module named 'fp'

To Reproduce
I run the following example and get error

"""
Basic example of scraping pipeline using SmartScraper
"""

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()


# ************************************************
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("OPENAI_APIKEY")

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "gpt-3.5-turbo",
    },
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their description.",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects/",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

D:\Application\Conda\envs\agentkit\python.exe D:\Projects\githubhub\agentkit\tmp.py 
Traceback (most recent call last):
  File "D:\Projects\githubhub\agentkit\tmp.py", line 7, in <module>
    from scrapegraphai.graphs import SmartScraperGraph
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\graphs\__init__.py", line 5, in <module>
    from .smart_scraper_graph import SmartScraperGraph
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py", line 5, in <module>
    from ..nodes import (
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\nodes\__init__.py", line 5, in <module>
    from .fetch_node import FetchNode
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\nodes\fetch_node.py", line 9, in <module>
    from ..utils.remover import remover
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\utils\__init__.py", line 8, in <module>
    from .proxy_rotation import proxy_generator
  File "D:\Application\Conda\envs\agentkit\lib\site-packages\scrapegraphai\utils\proxy_rotation.py", line 4, in <module>
    from fp.fp import FreeProxy
ModuleNotFoundError: No module named 'fp'

nomic-embed-text KeyError: 'Model not supported'

Describe the bug
I am trying to run the example from the readme.md file using ollama but the following error is raised:

    229 try:
--> 230     models_tokens["ollama"][embedder_config["model"]]
    231 except KeyError:

KeyError: 'nomic-embed-text'

To Reproduce
Steps to reproduce the behavior:

  1. Install Ollama and Scrapergraph-ai
  2. Create a script with the following code:
from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/llama3",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        "base_url": "http://localhost:11434",  # set Ollama URL
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",  # set Ollama URL
    }
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)
  1. See error:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File [~/venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py:230](http://localhost:8888/lab/tree/~/Documents/projects/techtonga-reddit/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py#line=229), in AbstractGraph._create_embedder(self, embedder_config)
    229 try:
--> 230     models_tokens["ollama"][embedder_config["model"]]
    231 except KeyError:

KeyError: 'nomic-embed-text'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
Cell In[1], line 16
      1 from scrapegraphai.graphs import SmartScraperGraph
      3 graph_config = {
      4     "llm": {
      5         "model": "ollama[/llama3](http://localhost:8888/llama3)",
   (...)
     13     }
     14 }
---> 16 smart_scraper_graph = SmartScraperGraph(
     17     prompt="List me all the articles",
     18     # also accepts a string with the already downloaded HTML code
     19     source="https://perinim.github.io/projects",
     20     config=graph_config
     21 )
     23 result = smart_scraper_graph.run()
     24 print(result)

File ~/venv/lib/python3.11/site-packages/scrapegraphai/graphs/smart_scraper_graph.py:47, in SmartScraperGraph.__init__(self, prompt, source, config)
     46 def __init__(self, prompt: str, source: str, config: dict):
---> 47     super().__init__(prompt, config, source)
     49     self.input_key = "url" if source.startswith("http") else "local_dir"

File [~/venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py:51](http://localhost:8888/lab/tree/~/Documents/projects/techtonga-reddit/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py#line=50), in AbstractGraph.__init__(self, prompt, config, source)
     48 self.config = config
     49 self.llm_model = self._create_llm(config["llm"], chat=True)
     50 self.embedder_model = self._create_default_embedder(    
---> 51     ) if "embeddings" not in config else self._create_embedder(
     52     config["embeddings"])
     54 # Set common configuration parameters
     55 self.verbose = True if config is None else config.get("verbose", False)

File [~/venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py:232](http://localhost:8888/lab/tree/~/Documents/projects/techtonga-reddit/.venv/lib/python3.11/site-packages/scrapegraphai/graphs/abstract_graph.py#line=231), in AbstractGraph._create_embedder(self, embedder_config)
    230         models_tokens["ollama"][embedder_config["model"]]
    231     except KeyError:
--> 232         raise KeyError("Model not supported")
    233     return OllamaEmbeddings(**embedder_config)
    235 elif "hugging_face" in embedder_config["model"]:

KeyError: 'Model not supported'

Expected behavior
I wanted to run the example

Desktop (please complete the following information):

  • OS: MacOS 14.31.
  • Browser Chrome

Additional context
Ollama with llama3 and nomic-embed-text are working ok outside this project.
image

Azure Example

Hi there, love the idea of the package! I'm looking to use my Azure instance to run this and I can't seem to find any examples of how to do it in the docs and haven't been able to get something working.

This is the starting point of my Azure-based LLM usage. How can I go from this to using this package with one of the models I have hosted there?

Thanks in advance.


from openai import AzureOpenAI
client = AzureOpenAI(
    azure_endpoint=my_endpoint,
    api_key=my_api_key,
    api_version=my_api_version
)

message_text = [{"role":"system","content":"You are an AI assistant that helps people find information."}]

completion = client.chat.completions.create(
    model="model_name",
    messages=message_text,
    ...
)

response_str = completion.choices[0].message.content.strip()

Fix XML example

Describe the bug
XML example file parse the input as if it had html content. The problem is related to the remover() function inside the fetch_node class.

To Reproduce
Steps to reproduce the behavior:

  1. Run examples/openai/scrape_xml_openai.py
  2. Warning:

--- Executing Fetch Node ---
C:\Python\Python311\Lib\html\parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument features="xml" into the BeautifulSoup constructor.
k = self.parse_starttag(i)
--- Executing Parse Node ---

  1. Output:

{'books': [{'title': "XML Developer's Guide", 'author': 'Unknown', 'genre': 'Unknown'}]}

Expected behavior
Do not raise exceptions

Screenshots
No screenshots

Desktop (please complete the following information):

  • OS: Windows
  • Browser [e.g. chrome, safari]
  • Library Version: 0.4.1

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
No additional context

Malicious data inside DOM

What happens if some malicious data is hidden (not visible to the regular human user) in the DOM?

For example:

Article 1

Article 2

Bomb!

//this is the hidden one

What is the output with a prompt like "Extract all the titles"? Is there a sanitization of the DOM before AI processing?
Morever, can content influence how the AI "see" the data of the DOM?

Convert Graphviz Diagrams to Scraping Class Structures

Description:

Building on our graph visualization capabilities, we aim to develop a feature that allows the creation of scraping class structures from Graphviz diagrams. This will enable rapid prototyping of scraping workflows from visual designs.

Tasks:

  • Define a standard format or conventions for Graphviz diagrams representing scraping workflows;
  • Implement a parser that reads Graphviz diagram files and extracts node and edge information;
  • Develop logic to generate Python class definitions for nodes and graphs based on the parsed diagram.

Please add examples to the documentation, struggling with understanding how it works

Wonderful! I was looking for something like this library!

But I found difficult to use the library because there are no examples.

For instance, I need to extract the name, the province and the address of all cinema theaters listed on this website:

PROVINCES
https://www.mymovies.it/cinema/

The information is divided on 3 levels: the above page contains the list of provinces, each cuty province links to the names of the theaters:

PROVINCES->THEATERS
https://www.mymovies.it/cinema/roma/

And each theatre name links to the address:

PROVINCES->THEATRES->ADDRESS
https://www.mymovies.it/cinema/roma/5157/

Sometimes there is a 4th level for smaller towns inside the province area of the city:

PROVINCES->SMALL_TOWNS
https://www.mymovies.it/cinema/roma/fianoromano/

PROVINCES->SMALL_TOWNS->THEATRES->ADDRESS
https://www.mymovies.it/cinema/roma/fianoromano/5102/

How am I supposed to ask that and get back the list of all Italian cinema theatres names and address divided by province?

Thanks in advance!

Add getState() function to Abstract Graph

Return the complete final state dictionary of the graph.
So, for example it can be used as:

final_state = smart_scraper.getState()

It could be possible to add an additional arguments to directly specify the key you want from the state

urls = smart_scraper.getState(key="urls")

The job post does not contain any body content.

Describe the bug
I'm trying to scrape the content of a page, asking to summarize the job post. What I get as answer is the following:
"The job post does not contain any body content."

To Reproduce
Scrape for example the following page

Expected behavior
Getting a summary of the page content.

Create for, while and switch nodes

Create new folder outside nodes called conditional nodes and add some new nodes that given an input a specific output will be chosen

  • for node: given an input iterates over that nodes and returns an output
  • switch/if node: given a state and a condition choose a condition and give the name (string) of the next node

Allow using a remote Ollama

People don't always run apps on the same computer where they run Ollama.
Since you provide a docker-compose.yml to run ollama, you might run your app in another container.

Allow people to set base_url to configure ChatOllama to connect to a different URL than http://localhost:11434 .

Groq model implementation

Is your feature request related to a problem? Please describe.
Groq is not supported

Describe the solution you'd like
Implement this model in the list of avaible model for Scrapegraph
VinciGit00/Scrapegraph-ai/assets/38807022/599fe445-2108-4a72-9570-f70c948457a0)

Describe alternatives you've considered
none

Additional context
Issue from discord

Fix releases in github

In order to make clear to the users the current release i suggest to implement a github action for release.

Basically move from that:

Screenshot 2024-03-12 at 11 23 43

To that:
image

Adding debugging mode

Is your feature request related to a problem? Please describe.
I am having a challenge finding out where my project is going wrong. When I run a prompt I receive {} as the output and don't get any further information as to why I got that as the result.

Describe the solution you'd like
Some sort of debugging mode with higher level of detail around each stage of the pipeline and what succeeded and failed to get down to the root of the issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.