Code Monkey home page Code Monkey logo

htmlmetadata's Introduction

About

htmlmetadata is a CLI app that extracts metadata out of HTML. Extremely fast (written in Nim), but might not handle edge cases.

I use this tool often, so you can be sure that it's maintained (i.e., working), even if it has not had recent activity.

Installation

Install nim, which includes its own package manager nimble:

brew install nim
# or
sudo apt install nim
# or ...

Now:

nimble install https://github.com/NightMachinary/htmlmetadata

Don't forget to add nimble's binary path (~/.nimble/bin/ on my machine) to your PATH.

Usage

You need to send the HTML input through stdin, by using, e.g., curl http://example.com | htmlmetadata ....

  htmlmetadata
    Will print all the extracted metadata in a humanly readable format.
  htmlmetadata <name-of-metadata> ...
    Will print the requested metadata only, separated by the NUL character. (The separator can't be the newline because the description metadata often contains newlines.)
Available metadata:
  title: string
  description: string
  image: string
  author: string
  creator: string
  site_name: string
  keywords: string

Examples

curl --silent https://slatestarcodex.com/2020/06/17/slightly-skew-systems-of-government/ | htmlmetadata

(title: "Slightly Skew Systems Of Government", description: "[Related To: Legal Systems Very Different From Ours Because I Just Made Them Up, List Of Fictional Drugs Banned By The FDA] I. Clamzoria is an acausal democracy. The problem with democracy is that โ€ฆ", image: "https://s0.wp.com/i/blank.jpg", author: "", creator: "", site_name: "Slate Star Codex", keywords: "")

curl --silent https://nintil.com/reversible-senescence | htmlmetadata

(title: "Nintil - Is cellular senescence irreversible?", description: "The internet's best blog!", image: "", author: "Jose Luis Ricon", creator: "", site_name: "", keywords: "economics, philosophy, technology, innovation, gdp growth, progress studies")

Note: cat -v is used to show the NUL character.

curl --silent https://nintil.com/reversible-senescence | htmlmetadata author | cat -v

Jose Luis Ricon

curl --silent https://nintil.com/reversible-senescence | htmlmetadata site_name author keywords |cat -v

^@Jose Luis Ricon^@economics, philosophy, technology, innovation, gdp growth, progress studies

Need to extract a metadata tag not covered by the API?

Check out the source! It's extremely easy to extend htmlmetadata. You can add support for a new tag by adding ~3 lines of code.

Similar projects

  • MetadataParser is a python library for extracting HTML metadata. It's 10x slower than htmlmetadata, but it handles edge cases better.

License

Dual-licensed under GPL3 (and its later versions) and MIT.

htmlmetadata's People

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.