Code Monkey home page Code Monkey logo

sde's Introduction

Structured Data Extractor (SDE) is an implementation of DEPTA (Data Extraction based on Partial Tree Alignment), a method to extract data from web pages (HTML documents). DEPTA was invented by Yanhong Zhai and Bing Liu from University of Illinois at Chicago and was published in their paper: "Structured Data Extraction from the Web based on Partial Tree Alignment" (IEEE Transactions on Knowledge and Data Engineering, 2006). Given a web page, SDE will detect data records contained in the web page and extract them into table structure (rows and columns). You can download the application from this link: Download Structured Data Extractor.

Usage

  1. Extract sde.zip.
  2. Make sure that Java Runtime Environment (version 5 or higher) already installed on your computer.
  3. Open command prompt (Windows) or shell (UNIX).
  4. Go to the directory where you extract sde.zip.
  5. Run this command: java -jar sde-runnable.jar URI_input path_to_output_file
  6. You can pass URI_input parameter refering to a local file or remote file, as long as it is a valid URI. URI refering to a local file must be preceded by "file:///". For example in Windows environment: "file:///D:/Development/Proyek/structured_data_extractor/bin/input/input.html" or in UNIX environment: "file:///home/seagate/input/input.html".
  7. The path to output file parameter is formatted as a valid path in the host operating system like "D:\Data\output.html" (Windows) or "/home/seagate/output/output.html" (UNIX).
  8. Extracted data can be viewed in the output file. The output file is a HTML document and the extracted data is presented in HTML tables.

Source Code

SDE source code is available at GitHub.

Dependencies

SDE was developed using these libraries:

  • Neko HTML Parser by Andy Clark and Marc Guillemot. Licensed under Apache License Version 2.0.
  • Xerces by The Apache Software Foundation. Licensed under Apache License Version 2.0.

License

SDE is licensed under the MIT license.

Author

Sigit Dewanto, sigitdewanto11[at]yahoo[dot]co[dot]uk, 2009.

sde's People

Contributors

seagatesoft avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sde's Issues

The download link is broken

Hi,

Thanks for your efforts in building this repo. I would like to try this tool in my project, but the download link is not longer valid, I am wondering if you still have the executable binaries available and fix the link? Thanks!

maven build files

Hi Sigit,

I have just tried your SDE tool after having it on my TODO list, and I am extremely impressed!
Do you still have any maven/ant build files for the project, or do you recall how you might have compiled it? ;)

Best Regards,
Johannes Ahlmann

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.