@celebi/cheerio-extract

Use the string of the specified rule to get the information

Usage no npm install needed!

<script type="module">
  import celebiCheerioExtract from 'https://cdn.skypack.dev/@celebi/cheerio-extract';
</script>

README

@celebi/cheerio-extract

Use the string of the specified rule to get the information

Usage

const HTML = `
<div>
  <a id="a" class="a" href="//baidu.com" data-id="link">百度</a>

  <dl>
    <dt>文本</dt>
    <dd><p>内容</p></dd>
  </dl>

  <section>
    前内容<u>下划线</u>后内容
  </section>

  <ul>
    <li data-index="1">
      <a href="//a.com/id1/101">(1)</a>
    </li>
    <li data-index="2">
      <a href="//a.com/id1/102">(2)</a>
    </li>
    <li data-index="3">
      <a href="//a.com/id1/103">(3)</a>
    </li>
    <li data-index="4">
      <a href="//a.com/id1/104">(4)</a>
    </li>
  </ul>
</div>
`
const CheerioExtract = require('@celebi/cheerio-extract').default; // Commonjs
// import CheerioExtract from '@celebi/cheerio-extract';  // ES module

const ce = new CheerioExtract(HTML);

// Get data
ce.query();
// Add custom function
ce.useFilter();

Grammar

  • Attribute related :AttributeName, case: :href
  • Remove dom -Tag, case -p Remove p tag
  • Filter method: | functionName(parameter1, ..., parameterN), case: | prefix(a, b) Other forms | functionName((parameter1), ..., (parameterN)) Parameter wrap(), | functionName no parameters required

attribute

ce.query('.a :href')               // output -> //baidu.com
ce.query('.a :data-id')            // output -> link

html or text

ce.query('.a | text')              // output -> 百度
ce.query('dd | html')              // output -> <p>内容</p>

delete dom

ce.query('section -u | text | trim');        // output -> 前内容后内容

add prefix or suffix

ce.query('.a :href | prefix(http:)')                     // output -> http://baidu.com
ce.query('.a :href | suffix(?q=123)')                    // output -> //baidu.com?q=123
ce.query('.a :href | prefix(http:) | suffix(?q=123)')     // output -> http://baidu.com?q=123

eq

ce.query('ul li:eq(2) :data-index')        // output -> 3
ce.query('ul li | eq(2) :data-index')      // output -> 3

filter text

ce.query('ul li:eq(2) a :href | filter(/, 1)')     // output -> a.comid03

list

grammar

  • Get one: | array(Rule)
  • Two-dimensional array: | array((Rule1), (Rule2))
  • Object data: | array(Key1 => (Rule1), Key2 => (Rule2))

Get one

ce.query('ul li | array(| text | trim)')

// output ->
  [
    '(1)',
    '(2)',
    '(3)',
    '(4)'
  ]

Two-dimensional array Note: The parameter is best to add ()

ce.query('ul li | array((:data-index), (a | text))')

// output ->
  [
    ['1', '(1)'],
    ['2', '(2)'],
    ['3', '(3)'],
    ['4', '(4)']
  ];

Object data => The front is the key, => The following is the rule, the rule is best to be wrapped with ()

ce.query('ul a | array(href => (:href | prefix(https:)), title => (| text))')

// output ->
  [
    { href: 'https://a.com/id1/101', title: '(1)' },
    { href: 'https://a.com/id1/102', title: '(2)' },
    { href: 'https://a.com/id1/103', title: '(3)' },
    { href: 'https://a.com/id1/104', title: '(4)' }
  ];