After one and half years of project existence we've decided to push it forward and rewrite the codebase. Archivist is facing transformation to a library for building publishing toolsets rather than predefined system. Still we like to think of Archivist as kind of missing link between digital documents and data. It is a key feature of project. We just want to make it more flexible.
Key features of upcoming release:
- New Substance core
- Substance Components instead of React.js and backbone (faster render time, less dependencies)
- Realtime collaboration instead of old locking system
- PostgreSQL instead of MongoDB and ElasticSearch (Yes, you will be able to run whole system on a small virtual server)
Archivist is open source project released under the GPL license. If you want to help project you can fund its development and maintenance.
Archivist is a full-stack publishing solution involving different technologies to power digital archives. Archivist originally has been developed for the [Memorial International Society](http://memorial.de/index.php?id=18) to publish interviews of Ost-Arbeiters and WWII prisoners. They are published as full transcriptions (in russian) complete with multimedia sources (audio, video). Editors are able to tag and link subjects, locations, persons and definitions in the text, so that the archive can be queried later in interesting ways. Researchers are able to perform a full-text search, but also filter interviews by related subjects and external entities.
Read more about Archivst in one of project mantainers post on Medium.
Opened sourced under GNU General Public License v3.0, created by Data for Society e.V., Substance Software GmbH, Memorial International Historical, Educational, Charitable and Human Rights Society.
Archivist is set of tools which all are communicating with Node.js server API, let's describe data models and each components afterwards.
There are several document models involved: interviews, subjects, entities and users. Interviews contain the transcripted text plus annotations and metadata, subjects are organized as a tree-structured ontology, and entities are just a list or set of units having additional information. Along with creating interview transcriptions, also the other ontologies are refined and extended. E.g., sometimes the editors merge two subjects or entities into one. Such changes need to have an immediate effect on the interviews, too, e.g., subject/entity references inside document need to be updated to reflect this change.
A custom but pretty conventional type of document, where paragraphs are statements of one person, the interviewer or the interviewee (still you can use it as usual paragraphs). The interview is annotated with basic markup, references to subjects, entities, and editor comments.
- name (used to display subject names in public part of archive)
- workname (used to display subject names in closed part of archive, e.g. Archivist Writer)
- description (subject description, currently it's not display anywhere outside of tree manager)
- parent (reference to the parent node inside tree)
- position (position inside tree leaf)
There are different types of entities:
- locations (toponyms and prisons, they are different, but both contains geo-locational information)
All entities are pretty much independent from each other.
User schema is pretty much oriented to store Google profile. However there is one important thing. Each user have access and super properties. First one it to grant access for the editing, second one is for granting super access. Super user can manage users and perform dangerous operations, e.g. merging and removing subjects and entities.
At the heart of the platform there’s Archivist Writer, a modern web-editor which allows you to annotate your text with basic markup and external data, e.g. you can:
- reference entities inside text
- mark any piece of text as related to some set of terms from an ontology-tree
- insert timecodes to synchronize text with media source
- leave comments for any piece of text for editors/researches collaboration
- describe the document’s metadata
Some of the interviews from Memorial’s Ost-Arbeiters archive last over 7 hours of time. The resulting documents are typically incredibly large, e.g., more than 10000 paragraphs, having the same amount of annotations and comments attached to it.
Archivist Writer protects every request using Archivists Auth API.
In the current Archivist platform we use a classical locking approach to avoid concurrent editing hazards. E.g. every document has version number, if your document version isn't equal to server one you'll not be able to save it.
You can connect any external data storage for referencing, basicly you just need to expose it as JSON REST API. Archivist as platform use special set of managers for managing data. Archivist Writer is built on top of Substance library. It's available as part of Archivist Core.
You can learn how to connect Archivist Writer to your application by discover source code of Archivist.
Archivist Reader was developed to present interviews in the best possible way. It makes possible exploring linked resources without losing the place in the interview. You can jump straight to the media source in that places where editors placed timecodes, so you can read and listen or watch the original record. Again without losing position in the text. All resources (locations, persons, etc.) have links to the [resource browser], so you can see a full description of locations and persons etc. as well as access all interviews where a specific entity was mentioned. You can also see every location on a map using our map browser.
Topics panel contains index like computed tree of used in concrete interviews subjects. You can click on each subject or entity and see where it mentioned inside of the document, e.g. it will hightlight exact fragments of text, scrollbar hightlighting will help you to understand where other fragemnts located.
Archivist Reader designed to play videos from Vimeo online video platform. It uses Vimeo JS API library Froogaloop for easiest video control. For playing audio files Reader use jPlayer library. You will need to upload audio files to your media server in two formats (mp3, ogg) to cover most of the modern browsers compability.
You can learn how to connect Archivist Reader to your application by discover source code of Archivist.
Archivist Browser is the main entry point for your archive. Here you can see a list of all documents from your archive, perform full-text search queries and filter using the ontology tree. Archivist Browser connected to our indexer API which is kind of interface for running ElasticSearch queries. Read more about indexing part of Archivist below. From the result list you are able to jump right into an interview highlighting the subject or entity you were interested in.
If your query matched name of entity (or synonym) which are mentioned in published documents it will tell you that it founds articles about that entity. Article names are links to [resource browser] pages.
This is additional peace to Archivist Browser, it will print you all details related to concreate entity as well as list of documents where this entity mentioned (with counters). From this list you can jump right into a document highlighting the entity you were interested in.
It's very simple application which is built using Substance light-weight component implementation inspired by React and Ember.
Checkout source code.
Map Browser is a map interface with two clustered layers which represents all records of toponyms and prisons entities. While hovering each point you’ll see the name of location and how often it’s mentioned in documents of your archive, you can click on the exact point and get the full description as well as all referenced documents complete with links to entry point of that entities inside a document.
Map Browser built on top of mapbox.js library, e.g. you can use your own tiles with the help of Mapbox platform. Map Browser use power of Leaflet.js and it's marker clustering plugin to let you browse your document collection from geographical perspective.
Map Browser also built using Substance components.
Checkout source code.
Protected part of Archivist contains set of data managers. All of them built on top of Backbone.js library. Let's look at them shortly.
Here you can see a list of documents. You can open exisitng, remove old one or create new document. You'll see when the document was edited last time as well as who edit exactly did it. Workflow and title filters will help you to find desired document. Document's workflow include four statuses: transcripted, verified, finshed and published. They are defined inside document schema and can be edited inside Archivist Writer along with other metadata.
Subjects manager represents your ontology as a big hierarchical tree with names and counters (you will see number of fragments for each subject summed up by the tree and number of referenced documents). You can manage subjects right here. E.g. you can
- add new subject
- find subject by name
- edit subjects name/workname/description
- move subjects across tree leafs
- request list of referenced documents
- remove subjects
- merge subjects Last two operations will activate maintanance mode.
Subjects manager use jstree library for tree editing and represenation of data.
Locations managers (toponyms and prisons) consists of list (similar to documents dashboard but with refernce counting and ability to request a list of referenced documents) and map views. You can add new locations in modal form. Forms are different for toponyms and prisons. However each of these forms contains maps with reversal geocoding for chosing location. You can also add number of synonyms for each location, later this values will be used as synonyms of location name inside archivist browser. All locations without geo coordinates will be placed in list inside map views, so you can easily fix them.
Persons manager is very similar to locations lists. You can also filter persons by global property. We used this property to filter out persons which a local to the document, e.g. we don't have enough information about person and it'll be excluded from search index.
Definitions also inherited list view style with modal forms. For Memorial International Memorial's archive we used four different types of definitions: general comments, prison camp reality, abbreveations, language comments. Last two types are excluded from search index.
Entity merging manager
This is manager which allows you to merge two entities. Even different types of entities could be merged. This manager available only for super users cause it will activate maintanance mode.
This manager used for granting or restriction access to the protected part of Archivist. You can gave access to editing document/entities/subkects or assign super users.
This manager available only for super users.
Let's look into server code.
Source code contains three folders:
- views (it contains Jade templates)
- models (here you can find mongoose data models definitions as well as a lot of static methods attached to each schema)
- controllers (here you can find a lot of APIs)
Let me describe some of the most important concepts.
REST like APIs
All entities use mongoose rest plugin which attach REST like methods to their schemas.
During merging or removing entities/subjects system will activate a maintanence mode. This means that no one will allowed to save any changes until all affected documents will changed. All users who are editing afected documents will be forced to refresh Archivist Writer.
System will create a backup of all changed documents and will restore it in case of any errors. If your system will crashed you'll likely stuck in maintenance mode. In this case you'll need to change maintanace mode flag manually. It is stored as a document inside mongodb system collection.
Some APIs used in memory caching. You can define how often this cache will be refreshed.
All Elastic Search related things are defined here. For indexing Archivist used special queue API which will gradually process all of your indexing tasks. If you need to run mass operations like index seeding or reindexing all documents do it from localhost, don't forget to uncommen this routes. Don't expose this methods on production server.
Archivist use Google OAuth 2 for authentication (we are using passport.js for that, so you can easily add more methods). For any other operations we are using JWT. Tokens are stored in user's localStorage, they are valid during one week. Token will be automaticaly renewed in lat two days if user will remain active, otherwise user will be forced to login after one week. Before every operation we are validating token and checking user permissions.
Archivist Core is isolated repository which contains:
- Interview Model
- Archivist Writer
- Archivist Reader
- Archivist Browser
All of that tools use legacy version of Substance library. All of them are available under MIT license.
You can translate interfaces of all open accessed tools using polyglot.js. Just fill all translations in language files and require our loader. You can implement laguage switcher that will call
i18n.switchLocale(locale) function, then after refresh you'll get translated interface.
For Archivist Reader you will also need to run static server with your audio files and photos. Archivist will expect to get respondent photos from photos folder and audio records from audio folder.
- Persons index
- Find and replace
- Bug fixes ;)
Archivist used environment file for all sensetive settings. Here is the list of available settings.
|ARCHIVIST_HOST||Url of your archive|
|AUTH_SECRET||Secret key to sign JSON Web Tokens (JWT)|
|ES_HOST||Elastic Search instance url|
|INDEX||Turn on indexing (true or false)|
|GOOGLE_CALLBACK||One of the redirect_uri values obtained from the Google Developers Console|
|GOOGLE_ID||The client ID you obtain from the Google Developers Console|
|GOOGLE_SECRET||The client secret you obtained from the Google Developers Console|
|MEDIA_HOST||Url of your media storage|
|MAPBOX_MAPID||Mapbox map ID|
|MAPBOX_TOKEN||Mapbox access token|
|MONGO_URL||MongoDB connection URI|
|RS_NAME||MongoDB replica set name (optional)|
Put them in .env file inside the root of Archivist directory, like this:
ARCHIVIST_HOST=localhost:5000 AUTH_SECRET=yourSescret ES_HOST=http://localhsot:9200 etc
$ npm install
Now you can run the server
Archivist shipped with a number of gulp scripts for bundling every component. If you installed Gulp globally you can use it for bundling each component individually. Just gulp with one of this suffixes: manager, i18n, writer, reader, browser, resources, maps. Calling just gulp will run all of them.
For running full version of Archivist platform you'll need:
- Node.js server
- MongoDB server
- Elastic Search instance
- HTTP static server
For uploading application you can use git as described in mentioned article. Don't forget to set environment variables, it will fail without them.
For a MongoDB hosting you can find a lot of hosting services, you can also do it by yourself. Keep in mind that using replica set is much safer. Our choise is Compose service.
For Elastic Search you can also find a lot of services. However we prefered to keep it in the DigitalOcean droplet in private network with application. Just give it more RAM.
Archivist platform released under the GPL license.