Revamping it all!
Soon after we announced the preview release of our search platform back in April of this year, we got busy adding more articles and more life sciences dictionaries to our platform and in the process we managed to put heavy stress on our infrastructure in all directions. As we added more articles, the keyword index grew beyond the number of attributes supported by the underlying database! As we added more dictionaries, the search computation time increased by many folds, resulting in occasional timeout of search queries. To make matters worse, tight coupling of our platform with the crawled and parsed data allocated too many resources (both compute and storage) for each of our deployments be it alpha, beta or internal development site. Incorporating any backup or disaster recovery plan would have further stressed and complicated our framework.
It soon became apparent that we needed to make some significant architectural changes. So we did!
A New Document Store
Thanks to our DFS implementation, our corpus data have always been distributed evenly across our infrastructure however as mentioned above our various deployments were tightly coupled with their underlying data leading to redundant compute and storage instances as well as absence of any pragmatic backup or disaster recovery plan. Therefore, Varun designed, thoroughly tested, and deployed a new document store layer on top of the existing DFS which now allows multiple applications to share or segregate the same data corpus through project namespaces. It also allows for encryption at the project level so that the data can be securely accessed by various deployments using private/public key infrastructure. Similarly, it also enables creating an active data backup at the project level that can be archived and restored as needed.
Processing Pipeline (ETL)
Though we had a decoupled architecture from the very start, the new document store enabled us to greatly simplify and streamline our processing pipeline (from crawler, word breaker, keyword index, nlp, classifiers, dictionaries to search results computation) to further break down components more discretely, to let them run concurrently on demand at the data ingestion time thereby reducing search time computations dramatically. Under this new architecture, each processing component derives from a simple application base class provided by the document store pipeline that processes newly uploaded/crawled documents/pages through a callback function implemented by the component developer. Here is an example of a simple component in our pipeline that prints the document id and its content when called with this document: Sample Application for Document Store Pipeline.
Each component can define scope of documents (see "domain_selector" attribute in the code above) to process and also define its dependent components (see "module_dependencies" attribute in the code above) so that the pipeline will ensure that dependent components are processed first.
With the new pipeline architecture and document store SDK, writing components (both our own and that from our future collaborators) has become much simpler. Varun was able to discard majority of our past monolithic crawler, nlp, indexer, storage components and rewrite them as new pipeline application modules in a short period of time. Here is a brief snippet to peek under the cover so to speak.
Here is how one uploads new content in the document store: Sample Application to add documents into document store
Here is how we build the initial text index (see more on this below): Code Snippet to create a keyword search index using document store pipeline
New Inverted Index
As stated above, one problem we encountered with growing data was the physical limit of the number of attributes we could store in the key-value database (Cassandra). That forced us to rethink and look more closely into other mature technologies for storing the inverted index (a list of documents and their relevant attributes as values and a set of words as keys (thus the name inverted index!). Apache Lucene indexing classes provided a good guiding framework for Varun to implement the same internally (without Java or large memory footprint (two of the reasons why we chose not to use Lucene itself) as demonstrated in the code snippet above. Furthermore, it has enabled our infrastructure to scale linearly over DFS while keeping the initial search index relatively small and super-fast.
While our claim to fame is our unsupervised learning algorithm which clusters results into meaningful concepts (topics), we can also leverage external dictionaries, catalogs or taxonomies to improve the quality and relevance of results we surface in any given domain or project. Our new document store SDK can be used by our partners to easily add new dictionaries into the platform with a few simple lines of code. Here is an example of how we integrated Protein Data Bank (PDB) identifiers as a dictionary in our document store (as a result of this integration, if a protein is identified by our search engine, we can return its attributes from PDB). Here's a sample Dictionary to extract PDB Identifiers using document store pipeline.
Similarly, here is an example of how we integrated Wikipedia references in our document store. Sample Dictionary to extract Wikipedia attributes using document store pipeline.
We will publish many more sample dictionary interfaces (such as Name, Place, or Event classifiers) along with some of our internal helper services (such as PDB and Wikipedia web services used in the scripts above) to help with extracting attributes from public or private dictionary sources. The new interface not only made writing dictionaries simple but also greatly improved search time performance. Before, our search compute time increased as we added more dictionaries. In our new framework each dictionary maintains its own results per document so the search engine simply queries matching terms from relevant dictionaries and coalesces all attributes across dictionaries. As an example, therefore a given protein now may have its attributes derived from PDB, Wikipedia entries, and any other supplied dictionaries (e.g. STRING) or product catalogs (e.g. Millipore Sigma, or Thermo Fisher Scientific).
While you experiment with our new framework which is up and running at https://beta.nlpcore.com, stay tuned for more exciting news from NLPCORE. Soon, we will publish our full set of APIs that allow both solution developers and algorithm developers to take our platform for a spin in life sciences vertical and beyond including their own data sets!
Scripting our Search Experience
Lastly, in the true spirit of all hands on deck, I decided to write a few python scripts to help our users look for specific search results without using our website. Thanks to the power of python, in only a few lines of code I was able to write an extensible console app which provides commands to list discovered topics (e.g. proteins, genes, disease name, etc.) discovered automatically from life sciences project, instance names for a given topic, and their specific references (surrounding text) within the articles. You are welcome to play with and modify the script (reader's exercise: find a instance name for a given topic where instance reference contains a specific word! e.g. find protein sequence that contains a "mutation") as you wish. Sample Console Application to list topics and annotation references using search APIs
Note: all scripts and API references provided here are pre-release, AS-IS and subject to change.