arm-pdf-scrape

Scrapes the instructions documented in the ARM manual

Usage no npm install needed!

<script type="module">
  import armPdfScrape from 'https://cdn.skypack.dev/arm-pdf-scrape';
</script>

README

arm-pdf-scrape

npm module to scrape the assembly instructions from an ARM manual.

This is intended for ARMv7-M Architecture Reference Manual. You must provide an existing copy of the manual yourself, this is just a scraper.

Install

npm install --save arm-pdf-scrape

Usage

const {loadPdfFromPath, generateInstructions, instructionToText} = require("arm-pdf-scrape")

const filepath = "/path/to/manual.pdf";
loadPdfFromPath(filepath)
  .then(manual => generateInstructions(manual))
  .then(instructions => {
    instructions.forEach(i => console.log(instructionToText(i)))
  })
  .catch(e => console.error(`Something went wrong: ${e}`))

Fluff

Scraping is imprecise, so we use expected values to guide it. E.g.,

  • The beginning of entries have A7.7.[0-9]+ near the start of the page text.
  • The syntax follows "Assembler syntax" in bold font.
  • There will be "Encoding 1", etc., in bold font.

Steps:

  • Get text chunks of each page
  • Strip the runners (headers and footers)
  • Sort chunks and combine same-line items when possible
  • Extract regions of section-body
  • Merge all regions into one array
  • Separate regions into instructions

TODO:

  • Nested bullets in SSBB, PSSBB
  • Math in QADD
  • Spacing of bold, italic, verbatim