A very easy way to extract texts from a html file.
I used it as a first step to extract the content of any mutual fund summary prospectus. For example, on this page, https://www.sec.gov/Archives/edgar/data/857489/000168386321001033/f8089d1.htm, there is the Summary Prospectus of Vanguard FTSE All-World ex-US ETF in February, 2021. This fund is particularly interesting because it has ETF share and the traditional mutual fund shares, which allows the conversion between the two types.
EDGAR system requires to register as a developer before being authorized to automatically scrape this webpage. For the sake of instruction, I use the U.S. Department of Labor's webpage about "Green Jobs" as an example to extract texts from html.