Code Monkey home page Code Monkey logo

wikipediacategoriesgraph's Introduction

Wikipedia Categories Graph

This repository contains the code to create the graph of the categories in Wikipedia.

Table of Contents

Introduction

Wikipedia has around 1.7 million categories. These categories are connected to each others in a graph structure. As part of the project of finding the most common words in the Main Topic Classification Categories, it was needed to map articles to one of these categories. We needed to create the Categories Graph. We used it to calculate the shortest path for mapping the articles to the closest macro category. More on this project here.

This repository contains only the code to create the graph, with some minor fixes to create the complete graph and not a partial one as in the project linked.

Installation

Make sure maven is installed on the host

The java program root folder is this one.

Open the terminal in this folder and run the following command

mvn package

Install Neo4j server and make sure it is running the bolt protocol on port 7687.

Usage

Creating the category database

Download the sql dumps of these wikipedia tables from the official page: category, categorylinks, page.

Change directory into the target directory (./wikipedia-graph-neo4j/target) and run:

java -jar .\wikipedia-graph-neo4j-0.0.1-SNAPSHOT.jar 
--spring.profiles.active=create-wiki-graph-db 
--category-dump-file=<path to category file> 
--page-dump-file=<path to page file>
--category-links-dump-file=<path to categorylinks file>
--base-folder=<the folder where the program outputs files>

Make sure to replace the placeholder <> with the paths to the downloaded files. This process takes several hours.

Exposing the HTTP interface

Simply run

java -jar .\wikipedia-graph-neo4j-0.0.1-SNAPSHOT.jar

By default it will be listening on localhost:8080, exposing some APIs.

Example of article mapping http request:

http://localhost:8080/mapCategory?startCategories=Database_management_systems::Databases&endCategories=Arts::Geography::Technology::Science::People::World

The query params start-category represents the categories the articles has, the end-categories represent the macro-category. One of the macro-category will be returned as result of the mapping. More on this algorithm of mapping can be read in the final paper of the project.

Example of shortest path http request:

http://localhost:8080/shortestPath?startCategory=Database_management_systems&endCategory=Arts&maxPathLength=10

You may have implemented other APIs to query the graph. If you used @Controller Spring Annotation, these will be available at the same host and port.

License

The package is Open Source Software released under the MIT license.

wikipediacategoriesgraph's People

Contributors

incioman avatar

Stargazers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.