Code Monkey home page Code Monkey logo

persian-raw-text's Introduction

Persian Raw Text - متن خام فارسی

The package contains a huge amout of Persian text, collected from the following sources:

Each resource is modified to exclude non-text content (urls, html, non-utf-8 content, etc). I have also dropped the lines that do not contain any Persian text. I have not done any deduplication; so there might be repeated content.

The overall data is here (~70GB, ~13.5million paragraphs).

Note: since the files are relatively large, you probably shouldn't download in your browser. A good way to download the files is to use gsutil (see the here for more). This would give details on the total download size, download progress, etc:

gsutil -m cp -R gs://danielk-files/farsi-text/merged_files/all_text_merged_cleaned.txt  .
Copying gs://danielk-files/farsi-text/merged_files/all_text_merged_cleaned.txt...
/ [0/1 files][600.2 MiB/ 69.8 GiB]   0% Done 

You can also use tools like wget:

$ wget https://storage.googleapis.com/danielk-files/farsi-text/merged_files/commoncrawl_fa_merged.txt 
--2020-05-17 14:53:08--  https://storage.googleapis.com/danielk-files/farsi-text/merged_files/commoncrawl_fa_merged.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.195.128
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.195.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68720495550 (64G) [text/plain]
Saving to: ‘commoncrawl_fa_merged.txt.1’

commoncrawl_fa_merged.txt.1                    0%[                                                                                              ] 542.30M  55.9MB/s    eta 17m 44s

Credits

If you find this repo useful, please include a reference to this repository.

persian-raw-text's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.