domgroup

define a group of elements from 2 or more examples

Usage no npm install needed!

<script type="module">
  import domgroup from 'https://cdn.skypack.dev/domgroup';
</script>

README

domgroup

Define a group of DOM elements from 2 or more examples.

get it

$ npm i --save domgroup

use it

  <script type=module src=node_modules/domgroup/lib.js></script>

examples

The main method you care about is generalize.

It takes as input an array of CSS selectors, an optional array of "negative CSS selectors" (selectors to disallow), and returns as output a positive and negative selector for the group.

For example:


  const myExamples = [
    'body > aside > div > a',
    'body > article > div > a',
    'body > header > div > a'
  ];

  const myNegativeExamples = [
    'body span a'
  ];

  const group = domgroup.generalize(myExamples, myNegativeExamples);

  // {positive: "body div a", negative: "body > span > a"}
  
  const groupElements = Array.from(
    document.querySelectorAll(group.positive)).filter( el => !el.matches(group.negative));

what's the point?

If you build a scraping application and want to give users the means to effortlessly define a group of elements, you ened a way to merge those selectors so that they capture as much of the information of each selector as possible, while also encompassing the group.

This is where domgroup comes in.

faq

how did you make this?

I applied the sequence alignment algorithm / longest common subsequence algorithm (used in bioinformatics) to objects, and specifically to DOM-type objects including information like IDs, classes and other characters of the DOM.

can I use this in my own project?

Yes, you are free to use this in whatever commercial / non-profit / whatever project you like per the terms of the applied MIT license.

what's the roadmap?

There's still some tasks to do, such as:

  • better support for any CSS selector
  • allow multiple positive and negative examples sets
  • indicate if a negative example completely empties / intersects the group defined by the positives
  • consider more signals (such as attributes / dataset / whatever) in the matching quotient.