README
Archivator
Ever wanted to archive your own copy of articles you enjoyed reading and to be able to search through them?
Version |
---|
CURRENT STATUS: This is frozen v1.x branch, future work is under v3.x-dev branch, but usable as-is see renoirb/archivator-demo
Summary
This project is a means to try out ECMAScript 2017 tooling and do something useful. See Challenge below.
The objective of this project is to:
(Note Check marks below :white_check_mark: denotes that work had been done and should be usable)
- :white_check_mark: Cache HTML payload of source Web Pages URLs we want archived (see
src/fetcher.js
) - :white_check_mark: Store files for each source URL at a consistent path name (see
src/normalizer/slugs.js
) (see v3.x-dev url-dirname-normalizer)- :white_check_mark: Extract assets, download them for archiving purposes (see
src/transformer.js
atextractAssets
andsrc/normalizer/assets.js
) (see v3.x-dev @archivator/archivable) - :white_check_mark: Download images ("assets") from Web Pages (see v3.x-dev @archivator/archivable)
- :white_check_mark: Rename assets in archive and adjust archived version to use cached copies (see
src/normalizer/hash.js
andsrc/transformer.js
atreworkAssetReference
) (see v3.x-dev @archivator/archivable) - :white_check_mark: Do not download tracking images and/or ignore inline
base64
images
- :white_check_mark: Extract assets, download them for archiving purposes (see
- Read link list from different source list
- RSS xml document
- :white_check_mark: CSV file (defaults to
archive/index.csv
)
- :white_check_mark: Extract the main content for each article (see
src/transformer.js
atextractAssets
) (see v3.x-dev @archivator/archivable) - :white_check_mark: Export into simplified excerpt document (see
src/transformer.js
atmarkdownify
) (see v3.x-dev @archivator/content-divinator) - Add documents into a search index
- Make a stand-alone bundle using
Rollup
- :white_check_mark: (incomplete) Make it usable as an external module (see renoirb/archivator-demo)
- :white_check_mark: Make it an NPM package
Use
Install production only dependencies.
Assuming you have dist/
compiled (see Build below), and you deleted node_modules/
.
npm install --only=production
Edit example.js
, add more urls
(if you want)
node example.js
Run through Babel
yarn install
Create a folder archive/
, add an index file that we'll use to read and fetch pages from
File is CSV, using semi-column ;
as a separator, fields are:
- URL to read from
- CSS selector to main part of the content you want to keep
- One or many CSS selectors (i.e. coma separated, like CSS supports already) of elements you want off of archives (e.g. ads)
// file archive/index.csv
https://renoirboulanger.com/blog/2015/05/converting-dynamic-site-static-copy/;article;
https://renoirboulanger.com/blog/2015/05/add-openstack-instance-meta-data-info-salt-grains/;article;
Run fetcher
npm start
You should see the following in the terminal output
...
Archived renoirboulanger.com/blog/2015/05/converting-dynamic-site-static-copy
Archived renoirboulanger.com/blog/2015/05/add-openstack-instance-meta-data-info-salt-grains
And you should see a few files getting created:
- cache.html: Is the raw HTML file download from the origin
- cache.json: Is a JSON cache of gathered metadata from the process
- index.md: Is the simplified article converted to Markdown
- Files with letters and numbers are images found in the document
archive/
`-renoirboulanger.com/
`-blog/
`-2015/
`-05/
`-add-openstack-instance-meta-data-info-salt-grains/
|- cache.html
|- cache.json
|- 5e6327f278a336349f8bb6b26163dabedb173bcd.png
|- 881811befc2fa6ad9c8ec058e1be3bd231fdcc1f.png
|- b69a780dc3278f5d86296d2f219821eeac385f20.jpg
|- c0e21ae7f0a56374116f08b44087d07ab8710035.png
|- c3d25fac5b0c573275b15822294e484097edd945
|- cd5f2a6cfa00a45e755b07013e59cb7c03bb9826.jpg
|- eb31cca43b832b0016a2211e6e0058b263f4a1c0.png
|- f6c4338884f46d3942589fcc29611fa68b600bad.png
|- index.md
Run tests
npm test
Run xo (coding convention linter)
npm run lint
Build
IMPORTANT This is no longer supported and is broken, see note in dist/README.md
Run in Node.js, as ECMASCript 5 transpiled code.
yarn install
npm run build
Should do the same as if we ran npm start
with modern Node.js v6+ with Babel
node dist/cli.js
Challenge
Make an archiving system while learning how to use bleeding edge JavaScript.
- Use ECMAScript 2016’ Async/Await along with Generators (
function * (){ /* ... */ yield 'something'; }
) - Figure out how to export into ES5
- Figure out how to package, test and so on
- Least number of dependencies as possible for development
- (Ideally) No dependencies to run once bundled