README
:octocat: ๐ท ๐ธ GitHub Scraper
Learn how to parse the DOM of a web page by using your favourite coding community as an example.
โ ๏ธ Disclaimer / Warning!
This repository/project is intended for
Educational Purposes ONLY.
The project and corresponding NPM module should not
be used for any purpose other than learning.
Please do not use it for any other reason
than to learn about DOM parsing
and definitely don't depend on it for anything important!
The nature of DOM parsing is that when the HTML/UI changes, the parser will inevitably fail ... GitHub have every right to change/improve their UI as they see fit. When they do change their UI the scraper will inevitably "break"! We have Travis-CI continuous integration to run our tests precisely to check that parsers for the various pages are working as expected. You can run the tests locally too, see "Run The Tests" section below.
Why?
Our initial reason for writing this set of scrapers was to satisfy the curiosity / question:
How can we discover which are the interesting people and projects on GitHub
(without manually checking dozens of GitHub profiles/repositories each day) ?
Our second reason for scraping data from GitHub is so that we can show people a "summary view" of all their issues in our Tudo project (which helps people track/manage/organise/prioritise their GitHub issues). See: https://github.com/dwyl/tudo/issues/51
We needed a simple way of systematically getting data from GitHub (before people authenticate) and scraping is the only way we could think of.
We tried using the GitHub API to get records from GitHub, but sadly, it has quite a few limitations (see: "Issues with GitHub API" section below) the biggest limitation being the rate-limiting on API requests.
Thirdly we're building this project to scratch our own itch
... scraping the pages of GitHub has given us a unique insight into the features of the platform which has leveled-up our skills.
Don't you want to know what's "Hot" right now on GitHub...?
What (Problem are we trying to Solve)?
Having a way of extracting the essential data from GitHub is a solution to a surprisingly wide array of problems, here are a few:
- Who are the up-and-comming people (worth following) on GitHub?
- Which are the interesting projects (and why?!)
- What is the average age of an issue for a project?
- Is a project's popularity growing or plateaued?
- Are there (already) any similar projects to what I'm trying to build? (reduce duplication of effort which is rampant in Open Source!!)
- How many projects get started but never finished?
- Will my Pull Request ever get merged or is the module maintainer too busy and did I just waste 3 hours?
- insert your idea/problem here ...
- Associative Lists e.g: People who starred
abc
also likedxyz
How?
This module fetches (public) pages from GitHub, "scrapes" the html to extract raw data and returns a JSON Object.
Usage
install from NPM
install from npm and save to your package.json
:
npm install github-scraper --save
Use it in your script!
var gs = require('github-scraper');
var url = '/iteles' // a random username
gs(url, function(err, data) {
console.log(data); // or what ever you want to do with the data
})
Example URLs and Output
Profile Page
User profile has the following format https://github.com/{username}
example: https://github.com/**iteles**
var gs = require('github-scraper'); // require the module
var url = 'alanshaw' // a random username (of someone you should follow!)
gs(url, function(err, data) {
console.log(data); // or what ever you want to do with the data
})
Sample output:
{
"type": "profile",
"url": "/iteles",
"avatar": "https://avatars1.githubusercontent.com/u/4185328?s=400&v=4",
"name": "Ines Teles Correia",
"username": "iteles",
"bio": "Co-founder @dwyl | Head cheerleader @foundersandcoders",
"uid": 4185328,
"worksfor": "@dwyl",
"location": "London, UK",
"website": "http://www.twitter.com/iteles",
"orgs": {
"bowlingjs": "https://avatars3.githubusercontent.com/u/8825909?s=70&v=4",
"foundersandcoders": "https://avatars3.githubusercontent.com/u/9970257?s=70&v=4",
"docdis": "https://avatars0.githubusercontent.com/u/10836426?s=70&v=4",
"dwyl": "https://avatars2.githubusercontent.com/u/11708465?s=70&v=4",
"ladiesofcode": "https://avatars0.githubusercontent.com/u/16606192?s=70&v=4",
"TheScienceMuseum": "https://avatars0.githubusercontent.com/u/16609662?s=70&v=4",
"SafeLives": "https://avatars2.githubusercontent.com/u/20841400?s=70&v=4"
},
"repos": 28,
"projects": 0,
"stars": 453,
"followers": 341,
"following": 75,
"pinned": [
{ "url": "/dwyl/start-here" },
{ "url": "/dwyl/learn-tdd" },
{ "url": "/dwyl/learn-elm-architecture-in-javascript" },
{ "url": "/dwyl/tachyons-bootstrap" },
{ "url": "/dwyl/learn-ab-and-multivariate-testing" },
{ "url": "/dwyl/learn-elixir" }
],
"contribs": 878,
"contrib_matrix": {
"2018-04-08": { "fill": "#c6e48b", "count": 1, "x": "13", "y": "0" },
"2018-04-09": { "fill": "#c6e48b", "count": 2, "x": "13", "y": "12" },
"2018-04-10": { "fill": "#7bc96f", "count": 3, "x": "13", "y": "24" },
...etc...
"2019-04-11": { "fill": "#c6e48b", "count": 1, "x": "-39", "y": "48" },
"2019-04-12": { "fill": "#7bc96f", "count": 5, "x": "-39", "y": "60"}
}
}
Followers
How many people are following a given person on Github.
Url format: https://github.com/{username}/followers
example: https://github.com/iteles/**followers**
var gs = require('github-scraper'); // require the module
var url = 'iteles/followers' // a random username (of someone you should follow!)
gs(url, function(err, data) {
console.log(data); // or what ever you want to do with the data
})
Sample output:
{ entries:
[ 'tunnckoCore', 'OguzhanE', 'minaorangina', 'Jasonspd', 'muntasirsyed', 'fmoliveira', 'nofootnotes',
'SimonLab', 'Danwhy', 'kbocz', 'cusspvz', 'RabeaGleissner', 'beejhuff', 'heron2014', 'joshpitzalis',
'rub1e', 'nikhilaravi', 'msmichellegar', 'anthonybrown', 'miglen', 'shterev', 'NataliaLKB',
'ricardofbarros', 'boymanjor', 'asimjaved', 'amilvasishtha', 'Subhan786', 'Neats29', 'lottie-em',
'rorysedgwick', 'izaakrogan', 'oluoluoxenfree', 'markwilliamfirth', 'bmordan', 'nodeco', 'besarthoxhaj',
'FilWisher', 'maryams', 'sofer', 'joaquimserafim', 'vs4vijay', 'intool', 'edwardcodes', 'hyprstack',
'nelsonic' ],
url: 'https://github.com/iteles/followers' }
ok 1 iteles/followers count: 45
If the person has more than 51 followers they will have multiple pages of followers. The data will have a next_page key with a value such as: /nelsonic/followers?page=2 If you want to keep fetching these subsequent pages of followers, simply keep running the scraper: e.g:
var url = 'alanshaw/followers' // a random username (of someone you should follow!)
gs(url, function(err, data) {
console.log(data); // or what ever you want to do with the data
if(data.next_page) {
gs(data.next_page, function(err2, data2) {
console.log(data2); // etc.
})
}
})
Following
Want to know the list of people this person is following
that's easy too!
The url format is: https://github.com/{username}/following
e.g: https://github.com/iteles/**following** or
https://github.com/nelsonic/following?**page=2**
(where the person is following more than 51 people ...)
Usage format is identical to followers
(above) so here's an example
of fetching page 3 of the results:
var gs = require('github-scraper'); // require the module
var url = 'nelsonic/following?page=3' // a random dude
gs(url, function(err, data) {
console.log(data); // or what ever you want to do with the data
})
Sample output:
{
entries:
[ 'kytwb', 'dexda', 'arrival', 'jinnjuice', 'slattery', 'unixarcade', 'a-c-m', 'krosti',
'simonmcmanus', 'jupiter', 'capaj', 'cowenld', 'FilWisher', 'tsop14', 'NataliaLKB',
'izaakrogan', 'lynnaloo', 'nvcexploder', 'cwaring', 'missinglink', 'alanshaw', 'olizilla',
'tancredi', 'Ericat', 'pgte' 'hyprstack', 'iteles' ],
url: 'https://github.com/nelsonic/following?page=3',
next_page: 'https://github.com/nelsonic/following?page=4'
}
Starred Repositories
The list of projects a person has starred a fascinating source of insight. url format: https://github.com/stars/{username} e.g: /stars/iteles
var gs = require('github-scraper'); // require the module
var url = 'stars/iteles'; // starred repos for this user
gs(url, function(err, data) {
console.log(data); // or what ever you want to do with the data
})
Sample output:
{
entries:
[ '/dwyl/repo-badges', '/nelsonic/learn-testling', '/joshpitzalis/testing', '/gmarena/gmarena.github.io',
'/dwyl/alc', '/nikhilaravi/fac5-frontend', '/foundersandcoders/dossier', '/nelsonic/health', '/dwyl/alvo',
'/marmelab/gremlins.js', '/docdis/learn-saucelabs', '/rogerdudler/git-guide', '/tableflip/guvnor',
'/dwyl/learn-redis', '/foundersandcoders/playbook', '/MIJOTHY/FOR_FLUX_SAKE', '/NataliaLKB/learn-git-basics',
'/nelsonic/liso', '/dwyl/learn-json-web-tokens', '/dwyl/hapi-auth-jwt2', '/dwyl/start-here',
'/arvida/emoji-cheat-sheet.com', '/dwyl/time', '/docdis/learn-react', '/dwyl/esta', '/alanshaw/meteor-foam',
'/alanshaw/stylist', '/meteor-velocity/velocity', '/0nn0/terminal-mac-cheatsheet',
'/bowlingjs/bowlingjs.github.io' ],
url: 'https://github.com/stars/iteles?direction=desc&page=2&sort=created',
next_page: 'https://github.com/stars/iteles?direction=desc&page=3&sort=created'
}
Repositories
The second tab on the personal profile page is "Repositories" this is a list of the personal projects the person is working on, e.g: https://github.com/iteles?tab=repositories
We crawl this page and return an array containing the repo properties:
var url = 'iteles?tab=repositories';
gs(url, function(err, data) {
console.log(data); // or what ever you want to do with the data
})
sample output:
{
entries: [
{ url: '/iteles/learn-ab-and-multivariate-testing',
name: 'learn-ab-and-multivariate-testing',
lang: '',
desc: 'Tutorial on A/B and multivariate testing',
info: '',
stars: '4',
forks: '0',
updated: '2015-07-08T08:36:37Z' },
{ url: '/iteles/learn-tdd',
name: 'learn-tdd',
lang: 'JavaScript',
desc: 'A brief introduction to Test Driven Development (TDD) in JavaScript',
info: 'forked from dwyl/learn-tdd',
stars: '0',
forks: '4',
updated: '2015-06-29T17:24:56Z' },
{ url: '/iteles/practical-full-stack-testing',
name: 'practical-full-stack-testing',
lang: 'HTML',
desc: 'A fork of @nelsonic\'s repo to allow for PRs',
info: 'forked from nelsonic/practical-js-tdd',
stars: '0',
forks: '36',
updated: '2015-06-06T14:40:43Z' },
{ url: '/iteles/styling-for-accessibility',
name: 'styling-for-accessibility',
lang: '',
desc: 'A collection of \'do\'s and \'don\'t\'s of CSS to ensure accessibility',
info: '',
stars: '0',
forks: '0',
updated: '2015-05-26T11:06:28Z' },
{ url: '/iteles/Ultimate-guide-to-successful-meetups',
name: 'Ultimate-guide-to-successful-meetups',
lang: '',
desc: 'The ultimate guide to organizing successful meetups',
info: '',
stars: '3',
forks: '0',
updated: '2015-05-19T09:40:39Z' },
{ url: '/iteles/Javascript-the-Good-Parts-notes',
name: 'Javascript-the-Good-Parts-notes',
lang: '',
desc: 'Notes on the seminal "Javascript the Good Parts: byDouglas Crockford',
info: '',
stars: '41',
forks: '12',
updated: '2015-05-17T16:39:35Z' }
],
url: 'https://github.com/iteles?tab=repositories' }
Activity feed
Every person on GitHub has an RSS feed for their recent activity; this is the 3rd and final tab of the person's profile page.
it can be viewed online by visiting:
https://github.com/{username}?tab=activity
e.g: /iteles?tab=activity
Parsing the Feed
The activity feed is published as an .atom xml string which contains a list of entries.
We use xml2js (which in turn uses the sax xml parser) to parse the xml stream. This results in a object similar to the following example:
{ '