The open-source repository mev.fyi aggregates research on Maximal Extractable Value (MEV). Explore curated academic papers, community contributions, and educational content on MEV and related topics.
Right now we manually add articles from websites. We can add a functionality to crawl all referenced websites and get the latest articles without manually adding them.
Conversely, we can create 'website' i.e. obtain the base domain from all handpicked articles.
We can implement the same curation system as for YouTube video processing
Because some articles are hosted on specific journals or websites which aren't a Discourse, Mirror, Medium backend and so forth, they might be referenced in the CSV aggregate but without their PDF counterpart. They need to be added manually
The idea is to be able to ask questions like "Return the sources of ", "Who are authors about ", "Who works at RIG" kind of questions.
Essentially it requires creating a new table by parsing the authors from each existing CSV aggregate
The Youtube data ingestion + indexing is big. It could be worthwhile to spin-off as a micro-service / package to make it more maintainable, have people fork it to build on top of it, and more.
Right now at every manual fetching of articles via get_articles_content, we overwrite the previous content with the latest fetched one, which propagates past edits.
We would need to add safeguards around making sure that the scrapping process does not lose a significant chunk of the content in the event of malfunction.
Similarly to indexing by authors, the idea is to have a small description for each website such that the end-user can ask questions like "Return all the VCs investing in MEV" or "What websites write content on shared sequencers"
Because some research papers are hosted on specific journals or websites which aren't SSRN, arxiv and so forth, they might be referenced in the CSV aggregate but without their PDF counterpart
Input: website URLs. Output: dict of website mapping to all the articles' URLs for that website (with pagination)
Approach: have a general script to which you pass config items for each website.
Work in progress:
Fix pagination
Make sure it works for all websites namely the config skeleton might need to be updated
If there are no articles in the website, first try go to the said website, find if there are other URLs available (in other indexes like /technology or /writing [...])
If there are new websites and the config already exists, add the empty config items to the existing config file
If there are new websites available e.g. a /technology while we added the /writing, then append this /techonology in to_parse.csv
Challenges:
Make sure it works for pagination
Make code general and robust. Abstract all the complexity into the config items. We can expect several containers, each with their own selectors, for each site
End goal:
get all the unique author blog posts. Then we crawl all their websites. Then once all unique articles' URLs are indexed, we scrap all articles and add them to the database.
Expected cost: 2-3 hours to reach >50% of websites covered. Challenges: possible numerous updates to the config format.
FAQ
Task: Obtain a list of all article urls for each website
classes are not important as long as the file works
input: called by cli with no arguments
output: dict of mapping of from website to lists to all article links
how does the code obtain the list of articles to crawl? -> implemented with the config file generated from websites.csv. Namely now all that is needed is to update the selectors for each website
the articles in medium should NOT be crawled because valmeylan is working on it
how do I know which websited should NOT be crawled because they only have one article?
modify an existing file created recently by valmeylan
There are a lot of awesome github repos we can index/scrape and add to the database.
Surely we can also automatically generate an "awesome of awesome" repos for MEV, DeFi, Security, with focus on each protocol, and more, as a way to SEO back to mevfyi
There can be several matches e.g. some medium authors' blog post are in the format <author>.medium.com/<article> while others are in the format www.medium.com/<author>
End goal:
get all the unique author blog posts. Then we crawl all their websites. Then once all unique articles' URLs are indexed, we scrap all articles and add them to the database.
Currently the Google Sheet of data.mev.fyi showing all data aggregates (articles, research papers, and so forth) isn't necessarily the nicest from a user experience perspective namely there is no user authentication, it would not remember filtering, and so forth. A nicer react frontend e.g. a second page on mev.fyi would drastically improve the experience of browsing the content database
Scrap Twitter threads in a readable format for the backend e.g. save as PDF
Would be great to explicitly indicate whether that tweet or thread is an answer to another tweet or thread, in which case it would be relevant to add as first post in the chronological order
The current articles processing has copy pasted methods while we can simplify down with config json and adapting to a few edge cases. Most of the time it is mostly about changing the CSS selector for each title, author, release date (+ correct formatting into yyyy-mm-dd), and content.
We can make something cleaner there where each CSS selector and date format is a parameter.
Even better: automate the process with a scrapping tool
So far we only parsed content on a per URL basis. An update to that would be to parse protocols' documentations by simply parsing the top-most documentations page e.g. the landing page for the docs, get the tree of all URLs in that docs, then scrap and add into database.
We can implement https://github.com/context-labs/autodoc to automatically generate documentations from github code repositories. That would be one step closer to make it easier to onboard developers to any protocol