README

Pagean

Pagean is a web page analysis tool designed to automate tests requiring web pages to be loaded in a browser window (e.g. 404 error loading an external resource, page renders with horizontal scrollbars). The specific tests are outlined below, but are all general tests that do not include any page-specific logic.

Installation

Install Pagean globally (as shown below), or locally, via npm.

npm install -g pagean

Usage

Pagean runs as a command line tool and is executed as follows:

Installed globally:
> pagean [options]

Installed locally:
> npx pagean [options]

Options:
  -V, --version        output the version number
  -c, --config <file>  the path to the pagean configuration file (default: "./.pageanrc.json")
  -h, --help           display help for command

Pagean requires a configuration file named, which can be specified via the CLI as detailed above, or use the default file .pageanrc.json in the project root. This file provides the URLs to be tested and options to configure the tests and reports. Details on the available tests and the configuration file format are provided below.

Test Cases

The tests use Puppeteer to launch a headless Chrome browser. The URLs defined in the configuration file are each loaded once, and after page load the applicable tests are executed. Test results are passed or failed, but can be configured to report warning instead of failure. Only a failed test will cause the test process to fail and exit with an error code (a warning will not).

Horizontal Scrollbar Test

The horizontal scrollbar test fails if the rendered page has a horizontal scrollbar. If a specific browser viewport size is desired for this test, that can be configured in the puppeteerLaunchOptions.

Console Output Test

The console output test fails if any output is written to the browser console. An array is included in the report with all entries, as shown below:

[
    {
        "_args": [],
        "_location": {
            "lineNumber": undefined,
            "url": "https://this.url.does.not.exist/file.js"
        },
        "_text": "Failed to load resource: net::ERR_NAME_NOT_RESOLVED",
        "_type": "error"
    }
]

Console Error Test

The console error test fails if any error is written to the browser console, but is otherwise the same as the console output test. This separation allows for testing for console errors, but allowing any other console output.

Rendered HTML Test

The rendered HTML test is intended for cases where content is dynamically created prior to page load (i.e. the load event firing). The rendered HTML is returned and checked with HTML Hint and the test fails if any issues are found. An array is included in the report with all HTML Hint issues, as shown below:

[
    {
        "col": 9,
        "evidence": "    <div id=\"div1\"></div>",
        "line": 6,
        "message": "The id value [ div1 ] must be unique.",
        "raw": " id=\"div1\"",
        "rule": {
            "description": "The value of id attributes must be unique.",
            "id": "id-unique",
            "link": "https://github.com/thedaviddias/HTMLHint/wiki/id-unique"
        },
        "type": "error"
    }
]

An htmlhintrc file can be specified in the configuration file, otherwise the default "./.htmlhintrc" file will be used (if it exists). See the Configuration section below.

Note: This test may not find some errors in the original HTML that are removed/resolved as the page is parsed (e.g. closing tags with no opening tags).

Page Load Time Test

The page load time test fails if the page load time (from start through the load event) exceeds the defined threshold in the configuration file (or the default of 2 seconds). The actual load time is included in the report. Tests will time out at twice the page load time threshold.

External Script Test

The external script test is intended to identify any externally loaded javascript files (e.g. loaded from a CDN) and aggregate those files so they can undergo further analysis (e.g. dependency vulnerability scanning). The test is included here since these tests load fully rendered pages, therefore allowing the aggregation of this data for pages generated using any language or framework. By default the test returns a warning if the page includes any javascript files loaded from a different domain than the page (although this could be overridden to fail instead via setting failWarn: false, see the Configuration section below). These files are then downloaded and saved in the "pagean-external-files" directory in the project root. Subdirectories are created for each domain, then following the URL path. For example, the following script...

<script src="https://bootstrapcdn.com/bootstrap/4.5.0/js/bootstrap.min.js"></script>

...will be saved as ./bootstrapcdn.com/bootstrap/4.5.0/js/bootstrap.min.js. The data array in the test report includes the original file URL and the local saved filename or applicable error, as shown below.

[
    {
        "url": "https://code.jquery.com/jquery-3.4.1.slim.min.js",
        "localFile": "pagean-external-scripts/code.jquery.com/jquery-3.4.1.slim.min.js"
    },
    {
        "url": "http://bootstrapcdn.com/bootstrap/4.5.0/js/bootstrap.min.js",
        "error": "Request failed with status code 404"
    }
]

Each external script is saved only once, but will be reported on any page where it is referenced.

Broken Link Test

The broken link test checks for broken links on the page. It checks any <a> tag on the page with href pointing to another location on the current page or another page (i.e. only http(s) or file protocols).

For links within the page, this test checks for existence of the element on the page, passing if the element exists and failing otherwise (and passing for cases that are always valid, e.g. # or #top for the current page). It does not check the visibility of the element. Failing tests return a response of "#element Not Found" (where #element identifies the specific element).
For links to other pages, the test tries to most efficiently confirm whether the target link is valid. It first makes a HEAD request for that URL and checks the response. If an erroneous response is returned (>= 400 with no execution error) and not code 429 (Too Many Requests), the request is retried with a GET request. The test passes for HTTP responses < 400 and fails otherwise (if HTTP response is >= 400 or another error occurs).
- This can result in false failure indications, specifically for file:// links (404 or ECONNREFUSED) or where the browser passes a domain identity with the request (page loads when tested, but 401 response for links to that page). For these cases, or other false failures, the test configuration allows a boolean checkWithBrowser option that will instead check links by loading the target in the browser (via puppeteer). Note this can increase test execution time, in some cases substantially, due to the time to open a new browser tab and plus load the page and all assets.
- If the link to another page includes a hash it is removed prior to checking. The test in this case is confirming a valid link, not that the element exists, which is only done for the current page.
- The test configuration allows an ignoredLinks array listing link URLs to ignore for this test. Note this only applies to links to other pages, not links within the page, which are always checked.
To optimize performance, link test results are cached and those links are not re-tested for the entire test run (across all tested URLs). The test configuration allows a boolean ignoreDuplicates option that can be set to false to bypass this behavior and re-test all links. The results for any failed links are included in the reports in any case.

For any failing test, the data array in the test report includes the original URL and the response code or error as shown below.

[
    {
        "href": "https://about.gitlab.com/not-found",
        "status": 404
    },
    {
        "href": "http://localhost:8080/brokenLinks.html#notlinked",
        "status": "#notlinked Not Found"
    },
    {
        "href": "https://this.url.does.not.exist/",
        "status": "ENOTFOUND"
    }
]

Reports

Based on the reporters configuration, Pagean results may be displayed in the console and saved in two reports in the project root directory (any or all of the three):

A JSON report named pagean-results.json
An HTML report named pagean-results.html

Both reports contain:

The time of test execution
A summary of the total tests and results (passed, warning, failed)
The detailed test results, including the URL tested, list of tests performed on that URL with results, and, if applicable, any relevant data associated with the test failure (e.g. the console errors if the console error test fails).

Complete reports for the example case in this project (the tests as specified in the project .pageanrc.json file) can be found at the links above.

Configuration

Pagean looks for a configuration file as specified via the CLI, or defaults to a file named .pageanrc.json in the project root. If the configuration file is not found, is not valid JSON, or does not contain any URLs to check the job will fail.

Below is an example .pageanrc.json file, which is broken into six major properties:

htmlhintrc: An optional path to an htmlhintrc file to be used in the rendered HTML test
project: An optional name of the project, which is included in HTML and JSON reports.
puppeteerLaunchOptions: An optional set of options to pass to Puppeteer on launch. There are no default options. The complete list of available options can be found at https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions.
reporters: An optional array of reporters indicating the test reports that should be provided. There are three possible options - cli, html, and json. The cli option reports all test details to the console, but the final results summary is always output (even with cli disabled). If reporters is specified, at least one reporter must be included. The default value, as specified below, is all three reporters enabled.
settings: These settings enable/disable or configure tests, and are applied to all tests overriding the default values.
- The shorthand notation allows easy enabling/disabling of tests. In this format the test name is given with a boolean value to enable or disable the test. In this case any other test-specific settings use the default values.
- The longhand version includes an object for each test. Every test includes two possible properties (some tests include additional settings):
  - enabled: A boolean value to enable/disable the test, and some tests include additional settings (default true for all tests).
  - failWarn: A boolean value causing a failed test to report a warning instead of failure. A warning result will not cause the test process to fail (exit with an error code). The default value for all tests is false except the externalScriptTest, as shown below.

The shorthand:

"settings": {
    "consoleErrorTest": true
}

is equivalent to the longhand:

"settings": {
    "consoleErrorTest": {
        "enabled": true,
        "failWarn": false
    }
}

All available settings with the default values are shown below.

urls: An array of URLs to be tested, which must contain at least one value. Each array entry can either be a URL string, or an object that contains a url string and an optional settings object. This object can contain any of the settings values identified above and will override that setting for testing that URL. The url string can be either an actual URL or a local file, as shown in the example below.

{
    "puppeteerLaunchOptions": {
        "args": [ "--no-sandbox" ]
    },
    "reporters": [
        "cli",
        "html",
        "json"
    ],
    "settings": {
        "horizontalScrollbarTest": {
            "enabled": true,
            "failWarn": false
        },
        "consoleOutputTest":  {
            "enabled": true,
            "failWarn": false
        },
        "consoleErrorTest":  {
            "enabled": true,
            "failWarn": false
        },
        "renderedHtmlTest":  {
            "enabled": true,
            "failWarn": false
        },
        "pageLoadTimeTest":  {
            "enabled": true,
            "failWarn": false,
            "pageLoadTimeThreshold": 2
        },
        "externalScriptTest":  {
            "enabled": true,
            "failWarn": true
        },
        "brokenLinkTest": {
            "enabled": true,
            "failWarn": false,
            "checkWithBrowser": false,
            "ignoreDuplicates": true,
            "ignoredLinks": []
        }
    },
    "urls": [
        "https://gitlab.com/gitlab-ci-utils/pagean/",
        {
            "url": "./tests/test-cases/consoleLog.html",
            "settings": {
                "consoleOutputTest": false
            }
        }
    ]
}

Docker Images

Provided with the Pagean project are Docker images configured to run the tests. All available image tags can be found in the gitlab-ci-utils/pagean repository at https://gitlab.com/gitlab-ci-utils/pagean/container_registry. Details on each release can be found on the Releases page.

Note: Any images in the gitlab-ci-utils/pagean/tmp repository are temporary images used during the build process and may be deleted at any point.

GitLab CI Configuration

The following is an example job from a .gitlab-ci.yml file to use this image to run Pagean against another project in GitLab CI:

pagean:
  image: registry.gitlab.com/gitlab-ci-utils/pagean:latest
  stage: test
  script:
    - pagean
  artifacts:
    when: always
    paths:
      - pagean-results.html
      - pagean-results.json
      - pagean-external-scripts/

Testing With Static HTTP Server

The Docker image shown above includes http-server and wait-on installed globally to run a local HTTP server for testing static content. The example job below illustrates how to use this for Pagean tests. The script starts the server in this project's test-cases directory and uses wait-on to hold the script until the server is running and returns a valid response. The referenced pageanrc file is the same as the project default pageanrc, but references all test URLs from the local server.

pagean:
  image: registry.gitlab.com/gitlab-ci-utils/pagean:latest
  stage: test
  before_script:
    # Start static server in test cases directory, discarding any console output,
    # and wait until the server is running
    - http-server ./tests/test-cases > /dev/null 2>&1 & wait-on http://localhost:8080
  script:
    - pagean -c static-server.pageanrc.json
  artifacts:
    when: always
    paths:
      - pagean-results.html
      - pagean-results.json
      - pagean-external-scripts/

Linting Pageanrc Files

A command line tool is also available to lint pageanrc files, which is executed as follows:

Installed globally:
> pageanrc-lint [options] [file] (default: "./.pageanrc.json")

Installed locally:
> npx pageanrc-lint [options] [file] (default: "./.pageanrc.json")

Lint a pageanrc file

Options:
  -V, --version  output the version number
  -j, --json     output JSON with full details
  -h, --help     display help for command

The --json option outputs the JSON results to stdout in all cases for consistency ([] if no errors found, so that it always outputs valid JSON). Otherwise errors are output to stderr, for example:

.\tests\test-configs\cli-tests\some-test.pageanrc.json
  <pageanrc>.puppeteerLaunchOptions                  should NOT have fewer than 1 items
  <pageanrc>.reporters[0]                            should be equal to one of the allowed values (cli, html, json)
  <pageanrc>.settings.consoleOutputTest              should be either boolean or object with the appropriate properties
  <pageanrc>.settings.pageLoadTimeTest.foo           should NOT contain additional properties: "foo"
  <pageanrc>.settings.pageLoadTimeTest               should be either boolean or object with the appropriate properties
  <pageanrc>.urls[2].settings.consoleOutputTest      should be either boolean or object with the appropriate properties
  <pageanrc>.urls[3]                                 should be either URL string or object with the appropriate properties
  <pageanrc>.urls[5]                                 should have required property url

In some cases, a single error might result in multiple messages based on the options in the schema definition, especially for cases that can be either a single value or an object with specific properties (e.g. the errors for <pageanrc>.settings.pageLoadTimeTest in the example above).

Note that because of the large number of options, which are dependent on an external project, the linting of puppeteerLaunchOptions only checks that at least one property is provided, it does not check the detailed settings.

Usage no npm install needed!