readme-crawler

Download README files from GitHub repository links

Usage no npm install needed!

<script type="module">
  import readmeCrawler from 'https://cdn.skypack.dev/readme-crawler';
</script>

README

README Crawler (npm package)

version downloads MIT License

A Node.js webcrawler to download README files and recursively follow contained GitHub repository links. Read more here.

Fetch the default README files display at a GitHub repository URL.

Installation

npm install --save readme-crawler

Usage

Create a new crawler instance and pass in a configuration object. Call the run method to download the README at the given URL.

  import ReadMeCrawler from 'readme-crawler';

  var crawler = new ReadMeCrawler({
    startUrl: 'https://github.com/jnv/lists',
    followReadMeLinks: true,
    outputFolderPath: './output/'
  });

  // -> fetch https://github.com/jnv/lists
  // -> download README in project root directory
  // -> export to new folder in root/output/repositories
  // -> generate list of other repository links
  // -> repeat steps on each link
  crawler.run();

Configuration Properties

Name Type Description
startUrl string GitHub repository URL formated 'https://github.com/user/repo'
followReadMeLinks boolean Recursively follow README links and export data at each repo
outputFolderPath string Folder in for README downloads starting in project root

Crawler Error

Issue: each repo link will be written to a file named linkQueue.txt. There could be issues writing to this file asynchronously while the crawler is activated.

Solution: restart the crawler with craweler.run() again. The link queue should contain links to use, but the crawler tried to read from the file before the file was finished writing.


spencerlepine.com  ·  GitHub @spencerlepine  ·  Twitter @spencerlepine