Code Monkey home page Code Monkey logo

data's Issues

[Data ingestion] add further articles

Because some articles are hosted on specific journals or websites which aren't a Discourse, Mirror, Medium backend and so forth, they might be referenced in the CSV aggregate but without their PDF counterpart. They need to be added manually

[Data ingestion] add website referencing

Similarly to indexing by authors, the idea is to have a small description for each website such that the end-user can ask questions like "Return all the VCs investing in MEV" or "What websites write content on shared sequencers"

[Data ingestion] add further research papers

Because some research papers are hosted on specific journals or websites which aren't SSRN, arxiv and so forth, they might be referenced in the CSV aggregate but without their PDF counterpart

[feature] Crawl all non-medium websites to fetch all articles

TODO

  • Update the src/populate_csv_files/get_article_content/crawl_non_medium_websites.py to crawl all posts (URLs) from all websites in data.mev.fyi available at data/links/websites.csv. Visualize websites at data.mev.fyi on Websites tab.
  • Input: website URLs. Output: dict of website mapping to all the articles' URLs for that website (with pagination)
  • Approach: have a general script to which you pass config items for each website.
  • Work in progress:
    • Fix pagination
    • Make sure it works for all websites namely the config skeleton might need to be updated
    • If there are no articles in the website, first try go to the said website, find if there are other URLs available (in other indexes like /technology or /writing [...])
    • If there are new websites and the config already exists, add the empty config items to the existing config file
    • If there are new websites available e.g. a /technology while we added the /writing, then append this /techonology in to_parse.csv

Challenges:

  • Make sure it works for pagination
  • Make code general and robust. Abstract all the complexity into the config items. We can expect several containers, each with their own selectors, for each site

End goal:

  • get all the unique author blog posts. Then we crawl all their websites. Then once all unique articles' URLs are indexed, we scrap all articles and add them to the database.
  • Expected cost: 2-3 hours to reach >50% of websites covered. Challenges: possible numerous updates to the config format.

FAQ

Task: Obtain a list of all article urls for each website

  • classes are not important as long as the file works
  • input: called by cli with no arguments
  • output: dict of mapping of from website to lists to all article links
  • how does the code obtain the list of articles to crawl? -> implemented with the config file generated from websites.csv. Namely now all that is needed is to update the selectors for each website
  • the articles in medium should NOT be crawled because valmeylan is working on it
  • how do I know which websited should NOT be crawled because they only have one article?
  • modify an existing file created recently by valmeylan
  • Put logging for verify that pagination works.
  • Continue this chat https://chat.openai.com/share/0f46d34f-156f-417a-ab1d-6924ac6462a2

[feature] extract all unique blog websites from articles

TODO

Example:

https://ethresear.ch/t/burning-mev-through-block-proposer-auctions/14029 -> https://ethresear.ch/t/
https://taiko.mirror.xyz/7dfMydX1FqEx9_sOvhRt3V8hJksKSIWjzhCVu7FyMZ -> https://taiko.mirror.xyz/
https://figmentcapital.medium.com/the-proof-supply-chain-be6a6a884eff -> https://figmentcapital.medium.com/

Helper:

The regexp hashmap url_patterns which identifies whether the link refers directly to an article, or its website e.g. the authors' blog post, available in https://github.com/mev-fyi/data/blob/main/src/populate_csv_files/parse_new_data.py

Challenge:

There can be several matches e.g. some medium authors' blog post are in the format <author>.medium.com/<article> while others are in the format www.medium.com/<author>

End goal:

get all the unique author blog posts. Then we crawl all their websites. Then once all unique articles' URLs are indexed, we scrap all articles and add them to the database.

[feature] Create a React front-end for data.mev.fyi Google Sheet

Currently the Google Sheet of data.mev.fyi showing all data aggregates (articles, research papers, and so forth) isn't necessarily the nicest from a user experience perspective namely there is no user authentication, it would not remember filtering, and so forth. A nicer react frontend e.g. a second page on mev.fyi would drastically improve the experience of browsing the content database

[Data ingestion] process twitter threads

Scrap Twitter threads in a readable format for the backend e.g. save as PDF
Would be great to explicitly indicate whether that tweet or thread is an answer to another tweet or thread, in which case it would be relevant to add as first post in the chronological order

[refactor] Articles processing

The current articles processing has copy pasted methods while we can simplify down with config json and adapting to a few edge cases. Most of the time it is mostly about changing the CSS selector for each title, author, release date (+ correct formatting into yyyy-mm-dd), and content.

We can make something cleaner there where each CSS selector and date format is a parameter.
Even better: automate the process with a scrapping tool

[feature] automatically map and scrap the content of any docs

So far we only parsed content on a per URL basis. An update to that would be to parse protocols' documentations by simply parsing the top-most documentations page e.g. the landing page for the docs, get the tree of all URLs in that docs, then scrap and add into database.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.