Code Monkey home page Code Monkey logo

distributed-wikipedia-mirror's Introduction

IPFS is an open system to manage data without a central server

Check out our website at ipfs.tech.

For papers on IPFS, please see the Academic Papers section of the IPFS Docs.

License

MIT.

distributed-wikipedia-mirror's People

Contributors

aschmahmann avatar fledgexu avatar flyingzumwalt avatar hsanjuan avatar ipfs-mgmt-read-write[bot] avatar jbenet avatar kanej avatar kubuxu avatar lidel avatar mkg20001 avatar momack2 avatar punkchameleon avatar victorb avatar web-flow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

distributed-wikipedia-mirror's Issues

[BOUNTY] Fix script responsible for preparing IPFS mirror

BOUNTY: $500 (how to claim?)

Summary

  • We unpack ZIM and put it on IPFS as a regular HTML+JS+CSS website.
  • Before its published we customize JS to a footer is attached to every page, informing reader that its unofficial wikipedia mirror
    • The build inserts custom footer into snapshots (#17, #15).
    • Existing scripts no longer work and need to be updated/redone before we start creating new snapshots (this is blocking #58, #61, #60)

TODO

Pick up where PR at #67 ended and update execute-changes.sh script to:

  • Ensure there are no JS errors when pages are loaded
  • Make it possible to navigate to other articles
    • ensure relative paths work
      • on https://ipfs.io/ipns/<cid>/wiki/
      • on https://<cid>.ipfs.dweb.link/wiki/
    • When unpacked with extract_zim, all pages are named ArticleName.html but they link to other article names without .html
      • idea: for every article, create ArticleName/index.html with a redirect page to ArticleName.html (similar to this)
  • Custom footer needs to be appended to every page
  • Update footer contents
    • add link to article snapshot at original Wikipedia
      • oldid= links can be found in page sources, for example:
        href="https://tr.wikipedia.org/w/index.php?title=<title>&amp;oldid=<timestamp>"
    • add link to the source .zim file
      • for now it can be link at download.kiwix.org, in the future it will be .zim on IPFS
    • remove logos/buttons of centralized services
    • include information on takedown policy / contact (eg. if latest snapshot includes information removed in upstream wikipedia)
  • Restore original Main Page
    • every wikipedia has a Main Page under different name
      • there are scripts to find out that name and fetch original for the right snapshot, see work started in bb9f48c
      • what needs to happen is to download original, fix it up to work locally and save it as /wiki/index.html

Acceptance Criteria

  • PR with necessary changes is submitted and merged to this repo
  • Script works and enables us to produce updated IPFS mirror of the latest Turkish snapshot: wikipedia_tr_all_maxi_2019-12.zim
    • CID of a demo output is provided

Add OS-native installers to ipfs station

IPFS Station is an electron app that installs and runs IPFS for you. We want to make it easier to install station by adding a package builder to IPFS station that generates installers for Win, Mac and Linux. This will make it possible for everyday users to run their own copy of IPFS and access it locally without using the command line.

More info and instructions in the ipfs station repository here: ipfs/ipfs-desktop#508

Insert custom footer into snapshots

Make a body.js file in this repo that inserts our footer instead of the original social media links. When we build wikipedia snapshots we will replace the original body.js with this new body.js so the footer includes a clear statement about who generated the snapshots. (see #13)

Add the new file to #16

Add English Snapshot

This is taking longer than the other languages because the english wikipedia dump is more than 20x bigger than the others.

Add Snapshot of http://en.wikipedia.org

The IPNS entry for this snapshots will be (we can update these links to point to the hash of the most recent snapshot) :

When the corresponding snapshot is ready, we will update that IPNS entry to point to the snapshot. We will also announce the link to that snapshot in comments on this issue and on the ipfs blog.

Modify snapshots to clearly declare that they were not created by wikipedia

If we publish unmodified snapshots of wikipedia it may confuse visitors, leading them to believe that the snapshots were published by Wikipedia or by volunteers who contribute to WIkipedia in Turkey. This could present a real risk for those volunteers, who have nothing to do with our snapshots, and may harm wikipedia, who did not ask us to do this.

I propose modifying the snapshots to:

  1. Remove references to the Wikimedia user group in Turkey from the bottom of the page

  2. Adjust the top of the page to distinguish it from Wikipedia (by removing the puzzle globe logo and adding a prominent explanation that this is an independent project, not affiliated with Wikipedia)

This will force us to scrap all of the snapshots we've built and start over with modified code, which will delay the release of the snapshots by at least a few days.

Automate snapshot updates

This is a placeholder issue.
Will be updated with more details when we gain better understanding of what is needed here.

In the long run, we want to introduce CI/CD automation that does something along these lines:

Then, maintainer would review PR and merge it.
Updating manifest in master would trigger an update of DNSLink under <lang>.wikipedia-on-ipfs.org, propagating change to collaborative cluster etc.

spec: worker for building snapshots

design a worker that uses the script from #35 to

  1. uses the script from #35 to
    • download data from a specified HTTP source
    • unpack the data and write it to IPFS
    • run a script (from this repo) that modifies the data (using IPFS files API)
    • pin the resulting data somewhere
  2. return the hash of the result
  3. (maybe) submit a PR to https://github.com/ipfs/distributed-wikipedia-mirror with the new hash

Optional:

Make the workers follow a queue.

Citation links broken

Assuming we are on page /ipfs/QmRoot/wiki/SomePage.html, when we try to click a citation link, instead of scrolling the page with a #cite_note-xx anchor, it tries to load /ipfs/QmRoot/wiki/SomePage#cite_note-xx, which fails, as SomePage does not exist.

Related to #2

.

.

Add Arabic and Kurdish snapshots

Add Snapshots of http://ar.wikipedia.org and http://ku.wikipedia.org

The IPNS entries for these snapshots will be (we can update these links to point to the hash of the most recent snapshot) :

When the corresponding snapshots are ready, we will update that IPNS entries to point to the snapshots. We will also announce the links to those snapshots in comments on this issue and on the ipfs blog.

Block internet search engines from indexing the mirror

If possible, are you able to make your mirror non-indexed by internet search engines? There is very minimal benefit for clearnet users to run across three (WMF, WikiVisually and ipfs) different copies of the Wikipedia article every time they search for something.

Search

Currently there is no search available in the IPFS version.

Turkish version has 521k titles with redirects (not sure right now how many without) which weight total of 13MiB (3MiB gzip compressed). Chalenge would be writing fuzzy search over them in ways that:

  • don't bright the browser to its knees
  • saves bandwidth (it is one time download if local ipfs is running but still downloading it right away for every one is a lot), so probably download gzip version.
  • is fast; uncompressing the gzip version every time to do the search is wastefull but also english wiki has 13M articles.
  • supports unicode

This might be a place to investigate some precalculated data structure that could be stored in IPFS to improve the search, speed and bandwidth wise. If the data structure is sharded in IPFS there should be a button to download whole search index.

Update tr.wikipedia-on-ipfs.org

  • create new (test) snapshot
    • couldn't extract fully, some files fail with other os error (dignifiedquire/zim#3)
    • fixed in extract_zim v0.2.0
  • ensure canonical link is correct (#48 (comment) + #65)
    • present in wikipedia_tr_all_maxi_2019-10.zim
  • ensure footer is updated (#64
  • identify landing page parameter for execute-changes.sh
  • fix any broken JS by execute-changes.sh
  • recreate snapshot if needed
  • pin
    • set up collaborative pinning cluster (#68)
  • update DNSLink at tr.wikipedia-on-ipfs.org

This could be done manually or as a part of #58

Turkish Mirror Doesn't have Page about the referendum

The Turkish referendum was the inciting incident that led Erdogan to block wikipedia. That is because it mentioned how he stole the election. It would be excellent if you could update that single page if at all possible before this goes live

Make `execute-changes` script more generally usable

Make the script from #18, which is in https://github.com/ipfs/distributed-wikipedia-mirror/blob/master/execute-changes.sh more generally usable. This is needed in order to fulfill #14 and in order to make it easier for other people to generate their own snapshots.

As a hacker who wants to add new wikipedia snapshots to IPFS in the language of my choice, I should be able to follow a clear set of instructions that allow me to download a zim dump, add it to IPFS, modify it with this script and then publish the resulting hash. The instructions should be clear, the configuration should be simple and it should be easy for me to set the correct IPNS hash and snapshot date based on the language version I'm adding.

Completion Requirements

  • the shell script works with minimal pre-configuration of your system
  • the shell script's documentation clearly declares what you need to do in order to run it
  • the shell script makes it easy or completely transparent to set the correct IPNS hash and snapshot date based on the current date and the language of the current snapshot
  • the readme at the root of this repo, or a page it links to, contains complete and accurate instructions for using this script when you're creating a snapshot

Lots of IPFS errors with Turkish wikipedia mirror (loading fine nonetheless)

The IPFS mirror of the Turkish wikipedia (via IPNS) loads fine, but lots of errors are popping up:

22:01:30.949 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Mameluke_Flag.svg.png: Failed to get block for zb2rhm7VzayQudATK8W2u37aGhPmEFknTou6zQXxCSHj6JYte: context canceled gateway_handler.go:548
22:01:30.966 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Flag_of_Greece_(1822-1978).svg.png: Failed to get block for QmP7hqsgd8NX2aCRUS3oGbgmM6ZSWaW21j7eXZ1ba7FVTv: context canceled gateway_handler.go:548
22:01:30.966 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Coa_Croatia_Country_History_(Fojnica_Armorial).svg.png: Failed to get block for QmVfQXFGao22h7EvPuH66dakgj4FxkN98cuYNYFKr6bCoG: context canceled gateway_handler.go:548
22:01:30.967 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Coat_of_arms_of_Hungary_(historic_design).png: Failed to get block for zb2rhX18u3QKYQQqeJtfE7kjHdNWtj9MhmZMNKNGoLLeT2Vmn: context canceled gateway_handler.go:548
22:01:30.969 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Coa_Kastrioti_Family.svg.png: Failed to get block for QmVrXEXLDLgGtdA5avsnAsfEwEFWXv1gkqYutc338TP6W8: context canceled gateway_handler.go:548
22:01:30.975 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Flag_of_Turkey.svg.png: Failed to get block for QmNhRZWFoKfeuwbnpf35J9g1NzZCzSfHqPr1WiHqb8kSVf: context canceled gateway_handler.go:548
22:02:10.846 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Botanik_Parkı.jpg: Failed to get block for QmbemscoX5aXoNXRTTJ1adbysDTTZFMbt6YwZVynkn2dhL: context canceled gateway_handler.go:548
22:02:10.850 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Harikalar_Diyari_Gulliver_Lilliputians_06034_nevit.jpg: Failed to get block for QmPvPmfSc9GcDriwZA99AD3MHqKbpTVTKiuU2KVJBdTtdr: context canceled gateway_handler.go:548
22:02:10.856 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Çayyolu_Metro_19.JPG: Failed to get block for QmXxcZjuJHq2qnxGEedooW1WNdHtzL77UmwsxHSEBpoHHj: context canceled gateway_handler.go:548
22:02:10.857 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Ankara_Central_Station_2012_03.JPG: Failed to get block for QmRfdvNsndpiNEbNTuH1ZaDfrrMsa1WNDKMNoT7HBcXkzY: context canceled gateway_handler.go:548
22:02:10.880 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Ankara_road_map.png: Failed to get block for zb2rhgcucL97ksC69WTWyfp7SF7FQPkZi6T6gz79qxNaGM2Ca: context canceled gateway_handler.go:548
22:02:10.884 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Dirus-Roads_of_Ankara.svg.png: Failed to get block for zb2rhiF3XBBBvTDhYAhw7qgcEB1FRtCrATUgZmEJ38bxQ9tjv: context canceled gateway_handler.go:548
22:02:17.185 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Desc-20.png: Failed to get block for QmXu8c6yjTpxNgiqmBnEjXCGRNVCNXpFzy6QdBPdef7q8c: context canceled gateway_handler.go:548
22:02:17.185 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/BlankMapTurkeyProvinces.png: Failed to get block for QmSDKbXfco38V8CR7niButc81cogiW9m1QaXHaxpocBTvD: context canceled gateway_handler.go:548
22:02:17.218 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Mausoleo_Mevlana.jpg: Failed to get block for zb2rhmWtXQ19xRDHSnWtHCzLk6tfVmq3wFFkf5FbUWsksFrta: context canceled gateway_handler.go:548
22:02:17.222 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Dolmabahçe_Palace_(cropped).JPG: Failed to get block for QmUXuCuWnqYE82sCYNQnn35Kk4yHgkHSgeNdcgymdWc3xj: context canceled gateway_handler.go:548
22:02:17.267 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Mustafa_Kemal_Atatürk_.jpg: Failed to get block for Qmbw36E5GkUSfnZ2jida2ufKNSEejpXwCH1eUDhgZBdB23: context canceled gateway_handler.go:548
22:02:17.323 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Recep_Tayyip_Erdogan.PNG: Failed to get block for QmaAUgvy4fEK76RdKkEVfYErR3LCF9iFbBqP4tui3kmRbE: context canceled gateway_handler.go:548
22:02:23.634 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Erdoğan_(2014_cumhurbaşkanlığı_seçim_logosu).jpg: Failed to get block for QmXqbHEuXywNnxQyygjZZpxVaaf2tWffrG7gUz6Tt4hTjS: context canceled gateway_handler.go:548
22:02:23.637 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Cari_islemler_dengesi.png: Failed to get block for QmRx4esjoYLmBV7nC4i1ZrqFSMAoqJ3bRgS5dK7fJ8LvdX: context canceled gateway_handler.go:548
22:02:23.637 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Family_photo_of_Council_of_ministers_of_Turkey_and_Spain.jpg: Failed to get block for QmQAe9RCdiuHunR6mmuMjYA2X8LGiWrYybPPLYgUy6Z57b: context canceled gateway_handler.go:548
22:02:23.704 ERROR commands/h: err: write tcp4 127.0.0.1:8080->127.0.0.1:57281: write: protocol wrong type for socket handler.go:288
22:04:34.318 ERROR core/serve: ipfs resolve -r /ipns/QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W/I/m/c326d317eddef3ad3e6625e018a708e290a039f6.svg: Failed to get block for zb2rhXQVxo883MW1cS2mYa3wYSv2ZLiQTmbn2EFrUXa1DSZUi: context canceled gateway_handler.go:548
22:04:34.319 ERROR core/serve: ipfs resolve -r /ipns/QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W/I/m/c5d0431ce231935522dc0cb52df7f2b406cdadc3.svg: Failed to get block for zb2rhfsuyFnUf9j4BJx27sNzNB31mYSkdDQHgEWCLB5vxaDq8: context canceled gateway_handler.go:548
22:04:34.320 ERROR core/serve: ipfs resolve -r /ipns/QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W/I/m/e1d67495288eac0fa90d5bbcad7d9a343c15ad56.svg: Failed to get block for zb2rhkXbtbZ9HpyHJixHRs6ioKjktd1ZJCiGfTNAK29BS8hjb: context canceled gateway_handler.go:548
22:04:34.320 ERROR core/serve: ipfs resolve -r /ipns/QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W/I/m/504dc030b18a6fcb9575b9c70b2d9314e86ece5e.svg: Failed to get block for QmemtQK1z99jDhqsHgnwihX3PjiWJpjHRA6PfXN6oCFeQV: context canceled gateway_handler.go:548
22:04:34.321 ERROR core/serve: ipfs resolve -r /ipns/QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W/I/m/Yüzey_sınıflandırması.jpg: Failed to get block for zb2rhnA149yrwMBERK9q35ZosesLekqEDn5ExhRjZYZi1xUn6: context canceled gateway_handler.go:548
22:04:34.322 ERROR core/serve: ipfs resolve -r /ipns/QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W/I/m/21de672b1953817ed423e8f4c008498a81341292.svg: Failed to get block for zb2rhaDTRHNvVM6PW88ih4J7euvotR34SozzZ1dvDdRbMe6sj: context canceled gateway_handler.go:548

Any updates? News? etc?

Is this still actively maintained? Is there any roadmap? When will the mirror be read-write?

Turkish IPFS mirror doesn't work with localhost redirect: core/serve error

When I deactivate my localhost redirect script, the turkish WP mirror loads fine (both ipfs snapshot & ipns current version) via the ipfs.io gateway. After enabling the userscript, the redirect to localhost works, i.e. the address has the correct hash etc., but browser & console print error messages:

00:35:07.573 ERROR core/serve: ipfs cat /ipns/QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W/wiki/Anasayfa.html: no link named "wiki" under QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX gateway_handler.go:525

00:37:51.368 ERROR core/serve: ipfs cat /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/wiki/Anasayfa.html: no link named "wiki" under QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX gateway_handler.go:525

Anyone know what's producing the error? Other ipfs/ipns address redirects seem to work fine.

EDIT: direct input of localhost addresses produces the same errors.

Write a script that modifies ZIM dumps to include info about how they were generated

  1. Replace out/-/j/body.js with the /scripts/body.js from this repo
  2. In that copy of body.js, replace these placeholders with the relevant values
    • {{SNAPSHOT_DATE}} -- date that this snapshot is being generated
    • {{IPNS_HASH}} - IPNS hash for this version of wikipedia (probably corresponds to the language code. ie. tr.wikipedia.org probably has its own ipns hash)
  3. Copy /assets/wikipedia-on-ipfs.png into the root of the snapshot
  4. Copy /assets/wikipedia-on-ipfs-small-flat-cropped-offset-min.png to /out/I/s/Wikipedia-logo-v2-200px-transparent.png in the snapshot

When you're done writing this script, please

  • add the script to this repo
  • Update the instructions in the repo's README.md to include the new step of running this script.

Add Arabic snapshot

blocked by #26 , replaces #14

Add Snapshot of http://ar.wikipedia.org

The IPNS entry for this snapshots will be (we can update these links to point to the hash of the most recent snapshot) :

When the corresponding snapshot is ready, we will update that IPNS entry to point to the snapshot. We will also announce the link to thjat snapshots in comments on this issue and on the ipfs blog.

Gather background info from other repositories and add to this one

Background info

The idea of putting Wikipedia on IPFS has been around for a while. Every few months or so someone revives the threads. You can find such discussions in this github issue about archiving wikipedia, this issue about possible integrations with Wikipedia, and this proposal for a new project.

what's missing?

There's an even bigger, longer-running conversation about decentralizing wikipedia. See https://strategy.m.wikimedia.org/wiki/Proposal:Distributed_Wikipedia

Current Work

Rendering pages from XML sources instead of kiwix HTML dumps

Zim files are nice source but they are a bit limiting. The layout is fixed (and it isn't wikipedia alike), they are updated much more rarely then XML dumps.

Rendering XML ourselves would be quite an effort so I think it is long term goal.

They are also good source of mediawiki assets (as no per language dumps od assets are available).

Provide recommendations for using InterPlanetary Test Lab to generate wikipedia snapshots

@FrankPetrilli in order to follow the data control plan described here we need a way to spin up workers based on a configuration (or docker file) and then use those workers to

  • download data from a specified HTTP source
  • unpack the data and write it to IPFS
  • run a script (from this repo) that modifies the data (using IPFS files API)
  • pin the resulting data somewhere
  • return the hash of the result

This overlaps with the kind of stuff that we do with InterPlanetary Test Lab. How hard would it be to set this up?

Add all the other wikipedia snapshots

Many countries have blocked Wikipedia over the years, and Wikipedia has also suffered from DDOS attacks causing service drops in various regions. We should broaden our current list of Wikipedia mirrors to include other languages. I propose doing the languages with the most users - as of now there are 13 languages with more than 1M users and 6 languages with more than 2.5M users.

Language Wiki Users Active Users Good Pages Total Pages On IPFS
English en 37103876 124823 5926367 48505470
Spanish es 5543043 15373 1543777 6782414
French fr 3544188 16349 2138049 10319452
German de 3271376 18550 2341066 6547819
Chinese zh 2803562 8437 1073112 5907531
Russian ru 2589568 10331 1567312 5987865
Portuguese pt 2297005 5696 1013328 4890332
Italian it 1867701 8066 1551915 6360102
Arabic ar 1709753 4342 949434 5651012
Japanese ja 1527805 14061 1167744 3460642
Indonesian id 1087372 2786 502154 2604650
Turkish tr 1042570 683 333283 1666749
Dutch nl 1018242 3866 1978041 4116417

For each:

  • create new snapshot of https://.wikipedia.org/wiki/
  • ensure canonical link is correct
    • this is blocked on upstream snapshots or updating our scripts: #48 (comment) + #65
  • pin
  • create and update DNSLink at .wikipedia-on-ipfs.org

links should be exactly the same as wikipedia

links should be exactly the same as wikipedia

  • pages have “.html” at the end, can we get no “.html”? (with directories + index.htmk or whatever)
  • i want the links to be the same so people can just change a prefix somewhere and have everything else just work
  • are media links the same as they are on wikipedia.org?

wrapper script for building snapshots

As a person who wants to build snapshots, I want to build a snapshot for a specific language by

  • cloning this repo
  • running a single script with a simple command (just providing the language code)

Completion state:

This repository provides a script that takes one required argument: the language code of the snapshot you want to build.
Based on that argument and the info in this repo, it

  • pulls from kiwix
  • unpacks the dump
  • modifies the dump
  • writes the result to ipfs
  • reports the resulting hash

Handle the logo override in a cleaner way

What's in the Snapshot

The style.js sets the logo with this css style:

.globegris{background-image:url(../../I/s/Wikipedia-logo-v2-200px-transparent.png) }

the logo is offset by a background-position style in the HTML of the page:

<td class="globegris" style="background-repeat:no-repeat; background-position:-40px -15px; width:100%; border:1px solid #a7d7f9; vertical-align:top;">

Current Hack

  1. I made a version of the wikipedia-on-ipfs logo that corrects for the offset: https://github.com/ipfs/distributed-wikipedia-mirror/blob/master/assets/wikipedia-on-ipfs-offset.png
  2. The script from #18, which needs to be run on ZIM dumps before adding them to IPFS, will copy that offset logo to /out/I/s/Wikipedia-logo-v2-200px-transparent.png in the snapshot

Alteratively we can modify the style.css, but we still need to deal with the offset unless we want to modify the style element in the html.

What needs to happen

  • confirm which approach we should use. If we want to use a different approach, update the assets and #18 accordingly
  • (if possible) make a version of the offset logo with better anti-aliasing on the resized text.

"Error: file does not exist" when trying to add CA wikipedia

Got a dump whose hash is QmXq9FMaTYKU6sY91XZyZvuFsee165FuGCyHvWQrQrwk33

Trying to run the final step, running ./execute-changes.sh. Running it without arguments prompts me for <ipfs files root> so I gave it the location of the root file I copied in the previous step. End up with the command ./execute-changes.sh /root but that gives me the Error: file does not exist error.

This is the full execution:

+ IFS='
        '
++ getopt --test
++ echo 4
+ '[' 4 -ne 4 ']'
+ LONG_OPT=help,search:,ipns:,date:,main:
+ SHORT_OPT=h
++ getopt -n ./execute-changes.sh -o h -l help,search:,ipns:,date:,main: -- /root
+ PARSED_OPTS=' -- '\''/root'\'''
+ eval set -- ' -- '\''/root'\'''
++ set -- -- /root
++ date +%Y-%m-%d
+ SNAP_DATE=2017-10-08
+ IPNS_HASH=
+ SEARCH=
+ MAIN=index.htm
+ true
+ case "$1" in
+ shift
+ break
+ '[' -z /root ']'
+ ROOT=/root
+ ipfs files stat /root/A
++ sed -e 's/{{SNAPSHOT_DATE}}/2017-10-08/g' -e 's/{{IPNS_HASH}}//g' scripts/body.js
++ ipfs add -Q
++ '[' -n '' ']'
++ cat -
+ NEW_BODYJS=QmYGpCuGLKAkF4fyENs5UgWgfx6qwLhY7yKpWSFSvSvV3F
+ ipfs-replace -/j/body.js /ipfs/QmYGpCuGLKAkF4fyENs5UgWgfx6qwLhY7yKpWSFSvSvV3F
+ ipfs files rm /root/-/j/body.js
+ true
+ ipfs files --flush=false cp /ipfs/QmYGpCuGLKAkF4fyENs5UgWgfx6qwLhY7yKpWSFSvSvV3F /root/-/j/body.js
Error: file does not exist

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.