Code Monkey home page Code Monkey logo

cc-gpx's Introduction

CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl

📜 This is the official code repository for the pre-print titled 'CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl' and published on arXiv: https://arxiv.org/abs/2405.11039

Authors: Ilya Ilyankou, Meihui Wang, Dr James Haworth and Dr Stefano Cavazzi

Abstract

The Common Crawl (CC) corpus is the largest open web crawl dataset containing 9.5+ petabytes of data captured since 2008. The dataset is instrumental in training large language models, and as such it has been studied for (un)desirable content, and distilled for smaller, domain-specific datasets. However, to our knowledge, no research has been dedicated to using CC as a source of annotated geospatial data. In this paper, we introduce an efficient pipeline to extract annotated user-generated tracks from GPX files found in CC, and the resulting multimodal dataset with 1,416 pairings of human-written descriptions and MultiLineString vector data from the 6 most recent CC releases. The dataset can be used to study people's outdoor activity patterns, the way people talk about their outdoor experiences, and for developing trajectory generation or track annotation models.

Example routes with descriptions from the paper

Setup

We recommend running the notebooks in a separate virtual environment. Using conda,

# Navigate to the project folder
cd cc-gpx

# Create a new virtual environment
conda env create -f environment.yml

# Activate that new virtual environment
conda activate cc-gpx

# Run Jupyter (will open in your default browser)
jupyter lab

Dataset

Run the notebooks in order to build the final GeoPackage dataset with the following fields:

# Property Description
1 url URL of the GPX file
2 warc_file CC WARC file with GPX file
3 warc_offset GPX file position in WARC
4 warc_len GPX file byte length
5 country Country name as determined by the first point in the track intersecting geoBoundaries
6 desc Original track description
7 desc_lang Track description language code, as determined by pycld2
8 desc_en Track description translated into English
9 elev_source GPS if elevation is recorded by device; DEM if determined later from Shuttle Radar Topography Mission
10 elev_highest Track’s highest point, m
11 elev_lowest Track’s lowest point, m
12 uphill Cumulative elevation gain, m
13 downhill Cumulative elevation loss, m
14 length_2d Track length disregarding elevation, m
15 length_3d Track length accounting for elevation, m
16 is_circular True if start and end points are within 350 m from each other, False otherwise
17 geometry MultiLineString Z geometry in GPS coordinates: (lat, lon, elevation)

Cite

If you find this dataset or workflow useful for your research, please cite us!

@article{ilyankou2024ccgpx,
      title={CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl}, 
      author={Ilyankou, Ilya and Wang, Meihui and Haworth, James and Cavazzi, Stefano},
      year={2024},
      journal={arXiv preprint arXiv:2405.11039},
}

cc-gpx's People

Contributors

ilyankou avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.