Topic: common-crawl Goto Github
Some thing interesting about common-crawl
Some thing interesting about common-crawl
common-crawl,Distributed download scripts for Common Crawl data
User: alumik
common-crawl,Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖
User: ashvardanian
Home Page: https://ashvardanian.com/posts/stringzilla/
common-crawl,German small and large versions of GPT2.
User: bminixhofer
common-crawl,This library is a very lightweight client to Common Crawl's WARC files.
Organization: bottomless-archive-project
common-crawl,An application that crawls the Common Crawl corpus for URLs with the specified file extensions.
Organization: bottomless-archive-project
common-crawl,Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.
Organization: code402
Home Page: https://code402.com/hello-warc-common-crawl-code-samples
common-crawl,Statistics of Common Crawl monthly archives mined from URL index files
Organization: commoncrawl
Home Page: https://commoncrawl.github.io/cc-crawl-statistics/
common-crawl,Various Jupyter notebooks about Common Crawl data
Organization: commoncrawl
common-crawl,Process Common Crawl data with Python and Spark
Organization: commoncrawl
common-crawl,Tools to construct and process webgraphs from Common Crawl data
Organization: commoncrawl
common-crawl,News crawling with StormCrawler - stores content as WARC
Organization: commoncrawl
common-crawl,This library gets urls from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl. Inspired by Corbin Leo's gau
User: connor-marchand
common-crawl,Drill into WARC web archives
Organization: crissyfield
common-crawl,Discourse Markers identification in French Language
User: dahouabdelhalim
common-crawl,This repository contains mapreduce extractors to preprocess and extract websites from the common crawl corpus.
User: erikgartner
common-crawl,Hadoop streaming EMR job
User: ggodreau
common-crawl,Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.
User: hadrianw
common-crawl,Parsing Huge Web Archive files from Common Crawl data index to fetch any required domain's data concurrently with Python and Scrapy.
User: hrn-projects
common-crawl,A dataset for knowledge base population research using Common Crawl and DBpedia.
Organization: ibm
common-crawl,We explore data by using Big Data Analysis and Visualization skills. To obtain this, we perform 3 main operations. i.e. i)Data Aggregation through different sources. ii) Big Data Analysis using MapReduce and iii) Visualization through Tableau. Data Analysis is very critical in understanding the data, and what we can do with the data. For small datasets it is easier to process and obtain the results. But as for big companies, it becomes crucial for them to obtain the trends of the company for any changes need to be made. Hence we introduce Big Data Analysis to solve this problem. In this lab, we collect close to 20000 tweets, 500 articles on New York Times and 500 articles on Common Crawl Data about Entertainment, which is our main topic of discussion. Using this data, we perform preprocessing and feed it to a MapReduce to find the Word Count and Word Co-Occurrence. Using this, we find the trend of the data collected in this topic. We have used Python to perform Data Analysis.Data Analysis is very critical in understanding the data, and what we can do with the data. For small datasets it is easier to process and obtain the results. But as for big companies, it becomes crucial for them to obtain the trends of the company for any changes need to be made. Hence we introduce Big Data Analysis to solve this problem. In this lab, we collect close to 20000 tweets, 500 articles on New York Times and 500 articles on Common Crawl Data about Entertainment, which is our main topic of discussion. Using this data, we perform preprocessing and feed it to a MapReduce to find the Word Count and Word Co-Occurrence. Using this, we find the trend of the data collected in this topic. We have used Python to perform Data Analysis.
User: mgosi
common-crawl,A python utility for downloading Common Crawl data
User: michaelharms
Home Page: https://github.com/michaelharms/comcrawl#readme
common-crawl,A Common Crawl client example for scraping specific websites.
User: neil-zt
common-crawl,An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
Organization: oscar-project
Home Page: https://oscar-corpus.com
common-crawl,The website of the Oscar Project
Organization: oscar-project
Home Page: https://oscar-project.org
common-crawl,:spider: The pipeline for the OSCAR corpus
Organization: oscar-project
Home Page: https://oscar-corpus.com
common-crawl,Parsing the common crawl database using Scala and Spark
User: skyler-myers-db
common-crawl,Perform big data analysis on New york times, Twitter and Common Crawl APIs
User: socket-var
Home Page: https://cse587-viz-kxodrdissk.now.sh/
common-crawl,Analyzing Common Crawl data (specifically) to classify fake/real based on trained deep learning models (LSTM, CNN)
User: srmocher
common-crawl,Common Crawl's processing tools
Organization: toimik
common-crawl,Various Common Crawl utilities in Clojure.
Organization: tokenmill
common-crawl,ES6 Class to read .warc or .warc.gz file member by member in nodejs
User: vikasg7
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.