url-inspector

Get metadata about any url

Usage no npm install needed!

<script type="module">
  import urlInspector from 'https://cdn.skypack.dev/url-inspector';
</script>

README

url-inspector

Get metadata about any URL.

Limited memory and network usage.

This is a node.js module.

It returns and normalizes information found in http headers or in the resource itself using exiftool (which knows almost everything about files but html), or a sax parser to read oembed, opengraph, twitter cards, schema.org attributes or standard html tags.

Both tools stop inspection when they gathered enough tags, or stop when a max number of bytes (depending on media type) have been downloaded.

A demo using this module is available, with url-inspector-daemon

  • url url of the inspected resource

  • title title of the resource, or filename, or last component of pathname with query

  • description optional longer description, without title in it, and only the first line.

  • site the name of the site, or the domain name

  • mime RFC 7231 mime type of the resource (defaults to Content-Type) The inspected mime type could be more accurate than the http header.

  • ext the extension matching the mime type (not the file extension)

  • type what the resource represents image, video, audio, link, file, embed, archive

  • html a canonical html representation of the full resource, depending on the type and mime, could be an image, anchor, video, audio, or iframe.

  • script url of a script to install along with the html representation Breaking change: used to be in the html representation, but that required special handling of html to make it work.

  • date (YYYY-MD-DD format) creation or modification date

  • author optional credit, author (without the @ prefix and with _ replaced by spaces)

  • keywords optional array of collected keywords (lowercased words that are not in title words).

  • size (number) optional Content-Length; discarded when type is embed

  • icon optional link to the favicon of the site

  • width, height (number) optional dimensions

  • duration (hh:mm:ss string) optional

  • thumbnail optional a URL to a thumbnail, could be a data-uri for embedded images

  • source optional a URL that can go in a 'src' attribute; for example a resource can be an html page representing an image type. The URL of the image itself would be stored here; same thing for audio, video, embed types.

  • error optional an http error code, or string

  • all an object with all additional metadata that was found

Installation

Besides npm i url-inspector:

  • exiftool
  • libcurl (and libcurl-dev if node-libcurl needs to be rebuilt)

Both programs are well-maintained, and available in most linux distributions.

API

const inspector = require('url-inspector');

// options and their defaults
const opts = {
 all: false, // return all available non-normalized metadata
 ua: "Mozilla/5.0", // some oembed providers might not answer otherwise
 nofavicon: false, // disable any favicon-related additional request
 nosource: false, // disable any sub-source inspection for audio, video, image types
 // new in version 2.3.0
 file: true
};

// opts are optional

const obj = await inspector(url, opts);

By default oembed providers are

  • found from a curated list of providers
  • discovered in the inspected web pages

It is possible to add custom providers in the options, by passing an array or a path to a module exporting an array:

opts.providers = [{
  provider_name: "Custom OEmbed provider",
  endpoints: [{
   schemes: ["http:\\/\\/video\\.com\\/*"],
   builder(urlObj, obj) {
    // can see current obj and override arbitrary props
    obj.embed = "custom embed url";
   },
   redirect(urlObj, ret) {
    // can change inspected url - use rewrite to make internal changes
    urlObj.path = "/another/path";
    return true;
   }
  }]
 }];

url-inspector uses node-libcurl to make http requests, and exposes it as:

const req = await inspector.get(urlObj);

where req.abort() stops the request, req.res is the response stream, and res.statusCode, res.headers are available.

Command-line client

inspector-url <url>
inspector-url <filepath>

Some options are available through cli, like --ua to test user agents.

Proxies

url-inspector configures http(s) proxies through proxy-from-env package and environment variables (http_proxy, https_proxy, all_proxy, no_proxy):

Read proxy-from-env documentation.

Low resource usage

network:

  • a maximum of several hundreds of kilobytes (depending on resource type) is downloaded but it is usually much less, depending on connection speed.
  • inspection stops as soon as enough metadata is gathered

memory: html is inspected using a sax parser, without building a full DOM.

exiftool: runs using streat module, which keeps exiftool always open for performance

Since version 2.3.0, file:// protocol is supported through cli by default, or setting "file" flag to true (false by default) through api.

License

See LICENSE.

See also

https://github.com/kapouer/url-inspector-daemon

https://github.com/kapouer/node-streat