README
arm-pdf-scrape
npm module to scrape the assembly instructions from an ARM manual.
This is intended for ARMv7-M Architecture Reference Manual. You must provide an existing copy of the manual yourself, this is just a scraper.
Install
npm install --save arm-pdf-scrape
Usage
const {loadPdfFromPath, generateInstructions, instructionToText} = require("arm-pdf-scrape")
const filepath = "/path/to/manual.pdf";
loadPdfFromPath(filepath)
.then(manual => generateInstructions(manual))
.then(instructions => {
instructions.forEach(i => console.log(instructionToText(i)))
})
.catch(e => console.error(`Something went wrong: ${e}`))
Fluff
Scraping is imprecise, so we use expected values to guide it. E.g.,
- The beginning of entries have
A7.7.[0-9]+
near the start of the page text. - The syntax follows "Assembler syntax" in bold font.
- There will be "Encoding 1", etc., in bold font.
Steps:
- Get text chunks of each page
- Strip the runners (headers and footers)
- Sort chunks and combine same-line items when possible
- Extract regions of section-body
- Merge all regions into one array
- Separate regions into instructions
TODO:
- Nested bullets in SSBB, PSSBB
- Math in QADD
- Spacing of bold, italic, verbatim