mev-fyi / data Goto Github PK

The open-source repository mev.fyi aggregates research on Maximal Extractable Value (MEV). Explore curated academic papers, community contributions, and educational content on MEV and related topics.

License: MIT License

Shell 0.55% Python 88.90% Roff 0.03% Jupyter Notebook 10.52%

mev research research-and-development research-tools

data's Issues

[feature] automatically scrap the referenced websites to fetch the latest published articles

Right now we manually add articles from websites. We can add a functionality to crawl all referenced websites and get the latest articles without manually adding them.

Conversely, we can create 'website' i.e. obtain the base domain from all handpicked articles.

We can implement the same curation system as for YouTube video processing

[Data ingestion] ZK knowledge

Scrap the relevant sources (TBD) to obtain ZK specialisation

[Data ingestion] add further articles

Because some articles are hosted on specific journals or websites which aren't a Discourse, Mirror, Medium backend and so forth, they might be referenced in the CSV aggregate but without their PDF counterpart. They need to be added manually

[New indexing] Create a mapping of all authors and their socials

The idea is to be able to ask questions like "Return the sources of ", "Who are authors about ", "Who works at RIG" kind of questions.
Essentially it requires creating a new table by parsing the authors from each existing CSV aggregate

[feature] spin off the whole youtube data ingestion as micro-service

The Youtube data ingestion + indexing is big. It could be worthwhile to spin-off as a micro-service / package to make it more maintainable, have people fork it to build on top of it, and more.

[feature] add safeguards when re-fetching PDFs articles to guarantee content's integrity

Right now at every manual fetching of articles via get_articles_content, we overwrite the previous content with the latest fetched one, which propagates past edits.

We would need to add safeguards around making sure that the scrapping process does not lose a significant chunk of the content in the event of malfunction.

[Data ingestion] add website referencing

Similarly to indexing by authors, the idea is to have a small description for each website such that the end-user can ask questions like "Return all the VCs investing in MEV" or "What websites write content on shared sequencers"

[Data ingestion] scrap L2s forum to add governance content

To add governance-related content, we can scrap Discourse forums e.g. Optimism, Arbitrum, Ethereum etc.

[Data ingestion] add further research papers

Because some research papers are hosted on specific journals or websites which aren't SSRN, arxiv and so forth, they might be referenced in the CSV aggregate but without their PDF counterpart

[feature] Crawl all non-medium websites to fetch all articles

TODO

Update the src/populate_csv_files/get_article_content/crawl_non_medium_websites.py to crawl all posts (URLs) from all websites in data.mev.fyi available at data/links/websites.csv. Visualize websites at data.mev.fyi on Websites tab.
Input: website URLs. Output: dict of website mapping to all the articles' URLs for that website (with pagination)
Approach: have a general script to which you pass config items for each website.
Work in progress:
- Fix pagination
- Make sure it works for all websites namely the config skeleton might need to be updated
- If there are no articles in the website, first try go to the said website, find if there are other URLs available (in other indexes like /technology or /writing [...])
- If there are new websites and the config already exists, add the empty config items to the existing config file
- If there are new websites available e.g. a /technology while we added the /writing, then append this /techonology in to_parse.csv

Challenges:

Make sure it works for pagination
Make code general and robust. Abstract all the complexity into the config items. We can expect several containers, each with their own selectors, for each site

End goal:

get all the unique author blog posts. Then we crawl all their websites. Then once all unique articles' URLs are indexed, we scrap all articles and add them to the database.
Expected cost: 2-3 hours to reach >50% of websites covered. Challenges: possible numerous updates to the config format.

FAQ

Task: Obtain a list of all article urls for each website

classes are not important as long as the file works
input: called by cli with no arguments
output: dict of mapping of from website to lists to all article links
how does the code obtain the list of articles to crawl? -> implemented with the config file generated from websites.csv. Namely now all that is needed is to update the selectors for each website
the articles in medium should NOT be crawled because valmeylan is working on it
how do I know which websited should NOT be crawled because they only have one article?
modify an existing file created recently by valmeylan
Put logging for verify that pagination works.
Continue this chat https://chat.openai.com/share/0f46d34f-156f-417a-ab1d-6924ac6462a2

[feature] index all URLs and ingest all "awesome <topic>" github repos

There are a lot of awesome github repos we can index/scrape and add to the database.
Surely we can also automatically generate an "awesome of awesome" repos for MEV, DeFi, Security, with focus on each protocol, and more, as a way to SEO back to mevfyi

[feature] extract all unique blog websites from articles

TODO

Create a script in https://github.com/mev-fyi/data/blob/main/src/populate_csv_files/get_article_content/get_websites_from_articles.py where we extract the unique authors' blog link from all the articles from https://github.com/mev-fyi/data/blob/main/data/links/articles_updated.csv (article header).
Create a second script to crawl of posts (URLs) from all websites.
- Output: all URLs crawled from the authors' blog posts

Example:

https://ethresear.ch/t/burning-mev-through-block-proposer-auctions/14029 -> https://ethresear.ch/t/
https://taiko.mirror.xyz/7dfMydX1FqEx9_sOvhRt3V8hJksKSIWjzhCVu7FyMZ -> https://taiko.mirror.xyz/
https://figmentcapital.medium.com/the-proof-supply-chain-be6a6a884eff -> https://figmentcapital.medium.com/

Helper:

The regexp hashmap url_patterns which identifies whether the link refers directly to an article, or its website e.g. the authors' blog post, available in https://github.com/mev-fyi/data/blob/main/src/populate_csv_files/parse_new_data.py

Challenge:

There can be several matches e.g. some medium authors' blog post are in the format <author>.medium.com/<article> while others are in the format www.medium.com/<author>

End goal:

get all the unique author blog posts. Then we crawl all their websites. Then once all unique articles' URLs are indexed, we scrap all articles and add them to the database.

[feature] Create a React front-end for data.mev.fyi Google Sheet

Currently the Google Sheet of data.mev.fyi showing all data aggregates (articles, research papers, and so forth) isn't necessarily the nicest from a user experience perspective namely there is no user authentication, it would not remember filtering, and so forth. A nicer react frontend e.g. a second page on mev.fyi would drastically improve the experience of browsing the content database

[Data ingestion] process twitter threads

Scrap Twitter threads in a readable format for the backend e.g. save as PDF
Would be great to explicitly indicate whether that tweet or thread is an answer to another tweet or thread, in which case it would be relevant to add as first post in the chronological order

[refactor] Articles processing

The current articles processing has copy pasted methods while we can simplify down with config json and adapting to a few edge cases. Most of the time it is mostly about changing the CSS selector for each title, author, release date (+ correct formatting into yyyy-mm-dd), and content.

We can make something cleaner there where each CSS selector and date format is a parameter.
Even better: automate the process with a scrapping tool

[feature] automatically map and scrap the content of any docs

So far we only parsed content on a per URL basis. An update to that would be to parse protocols' documentations by simply parsing the top-most documentations page e.g. the landing page for the docs, get the tree of all URLs in that docs, then scrap and add into database.

[feature] implement autodoc to automatically generate docs from github code repositories

We can implement https://github.com/context-labs/autodoc to automatically generate documentations from github code repositories. That would be one step closer to make it easier to onboard developers to any protocol

mev-fyi / data Goto Github PK

data's Issues

TODO

Challenges:

End goal:

FAQ

TODO

Example:

Helper:

Challenge:

End goal:

Recommend Projects

Recommend Topics

Recommend Org