README
xml-to-es
xml-to-es
was originally written to translate [David Lewis's Reuters collection in SGML](http://www.daviddlewis
.com/resources/testcollections/reuters21578/) intocleaned-up JSON for ElasticSearch.
It has been improved to translate XML into JSON, HTML, raw text. Output is to one file per XML document or one file N XML documents or one file for the output of all the input XML documents.
In version 0.2.0, xml-to-es
can accept a generator
argument that supports output to any kind of sink,
including a stream. There is an example of this in examples/db-config.js
.
Using examples/convert.js
, documents can be submitted as a comma-delimited list or as a directory name with
Translates XML (or SGML) documents into JSON documents suitable for ElasticSearch -- or into plain text suitable for
OpenNLP program input. Additionally, there is an HTML output option intended for use with Chiliad Discovery and an
open-ended option that allows such things as pushing the documents to a database.
There is also a module examples/indexFiles.js
and a mapping.json file that can be used as examples to submit
the JSON-ized files to ElasticSearch.
xml-to-es uses libxml-to-js (and the underlying libxmljs) and then massages the resulting JSON object to make it meaningful for search engines. Elasticsearch can handle nested JSON, but the nested JSON produced automatically from XML/SGML is almost never what you want for indexing or any other purpose. In addition, most XML is by nature very noisy. xml-to-es gives you some fine-grained control over what is produced in JSON.
A simple config file lets you
- preProcess the JSON to get it in shape for subsequent processing as described in the remaining bullet items
- un-nest the XML by promoting important elements to top level
- turn attributes (represented as '@' properties) into elements using promotion
- flatten arrays which are rendered as
[{ '#' : value1}, {'#' : value2},...]
by libxml*. - delete elements you don't need
- rename elements
XML/SGML structural errors
XML/SGML structure in the wild is not always perfect, so xml-to-es handles some anomalies:
* missing closing tag
* missing opening tag
xml-to-es handles these by examining the XML/SGML input before it is JSON-ized.
It will not currently handle a sequence where the first document has a missing closing tag and the second document
has a missing opening tag. That would be possible but is left for future work as needed. (The input file
test/data/twoDocsNoSepTagsTest.xml
is available for
experimentation on that problem. Note: extension is xml
instead of sgm
to protect tests.)
Installation
With npm, just do:
npm install xml-to-es
Then, cd
to xml-to-es directory and run:
npm install
For github:
git clone http://github.com/imbroglioj/xml-to-es.git
Documentation
The examples
and test
directories show a number of ways to use and control these modules.
Examples directory
convert.js
: useslib/xml-to-es.js
to convert XML/SGML to massaged JSONindexFiles.js
: using an index config file, will submit a set of JSON files for indexing to ElasticSearch (requires that ElasticSearch be running!)*-config.js
: various input and output configuration files to be used withconvert.js
andindexFiles.js
Notes on running examples:
- If you want to copy
convert.js
orindexFiles.js
to top level to experiment with modifying them, you will have to- change
require(path.resolve(__dirname,'index.js'))
torequire('xml-to-es')
- install the
optimist
module (usingnpm
).
- change
JSON tweaks
Some JSON tweaks are provided using the config file input
property:
input-config
preProcess
: modify the JSON object (config.json) in any waypromote
: move a nested element/object-property to be a top level object propertydelete
: remove unneeded properties from the resulting JSONrename
: rename property keys in the JSONflatten
: Somewhat likepromote
, but typically used to remove noise from what should be an array. Some SGML kludges can create complicated object nesting in the JSON which can completely obscure the fact that we have an array as a property value. (To see a before example, temporarily remove theflatten
property from a copy oflewis-input-config.js
.) Once you identify the offending XML tag ('d' for the Reuters collection), xml-to-es will remove the extra tag and flatten the array value by removing place-holder property names like '#'.
output-config
The output-config must require
the input-config you want.
The output config file (examples: json-config.js
, db-config.js
, text-only-config.js
) give
examples of the output options.
- fmt: JSON|HTML or whatever formats you might add to Generation.js
- noFile: true if there is a user-supplied generator and it creates its own output sink
- fileExt: extension of the output file (the output file name is created from the input file name, the JSON id property and the output.fileExt). IF REGEX, terminate with "