NLPCORE: COVID-19 Data Visualizer! (By Yos Wagenmans, High School Intern)
First, I would like to thank Naveen Garg (Co-founder and CEO) and Varun Mittal (Co-founder and CTO) for their constant support in this project, as it would not have been possible without them. We invite you to check back in the future for any updates.
Our objective was to give researchers a better way of understanding the contents of the COVID-19 Open Research Dataset and the connections that might exist among various biological concepts across one or more articles. Using Natural Language Processing and Machine Learning algorithms created by NLPCORE, we can find not only the most relevant documents but also various biological concepts and their inter-connections, for an inputted search term. The programs we have created are being constantly updated and streamlined to accomplish the required tasks more effectively.
In our approach, we explored using a node-based graph but ultimately decided on a concentric ring chart to represent the vast amount of data. The main drawback with node-based visualization is the lack of scalability when hundreds of nodes have to be shown. For my summer 2019 project, I had explored building this node graph (where each node represents a relevant concept (word or a phrase) from an article); see below. We looked for a diagram where the query term could be centrally located, depth/degree of relevance could be shown, and topics/types could be represented. The D3.js Sunburst Visualization library enabled us to do just that, even displaying the path to the selected topic. An example is shown below:
The chart on the left has Coronavirus as the central term and transmission as a default secondary term. From there the algorithm sorts the data by the degree of relevance and groups the similar-typed nodes together. Each bar of color is a singular node, however, the size of which is determined by the frequency of nodes of that type. The path diagram on the right provides an easy way to see the related parent categories.
Once a user clicks a wedge piece, the new query term is sent to the program, where it displays a node-graph below where there is an updated central term (elaborated in NLPcore blog post: here). Shown in the visual below is the interactive force-based graph with Coronavirus as the default center. Color coding helps organize the terms; click-functionality on the nodes/edges has been written, but not yet implemented in this version. This would generally direct to a ranked list of scientific articles with the search terms highlighted in corresponding colors. Additionally, physics features can be altered using the sliders on the right.
To process the vast amounts of data, this program makes API calls to the NLPCORE server and rapidly receives the necessary outputs for a visualization to be drawn. More information on the APIs can be found on the API catalog.
Code can be found in and run from the NLPCORE Gitlab:
D3.js - https://github.com/d3/d3/wiki
Sunburst - https://observablehq.com/@d3/sunburst
NLPcore algorithms - Corpus Search Methods Patents
Summer 2019 Internship - Node Visualizer
I designed an improved data sorter/visualizer for data points (legal, medical, etc.) on the server. The overall goal for this project was to create an efficient visualizer that could easily be implemented in the existing product.
In the first several weeks, I familiarized myself with pandas Dataframes and the different attributes that could be used to sort it into the desired data. As I began to create visualizers for the data being gathered, some constraints needed to be followed: scalability (10 or 500 nodes), physics modeling, and compatibility with key data points. The first iteration used a library called NetworkX, which ran using Matplotlib; there were however many limitations as seen in the following:
It was both unscalable and contained no physics engine. From here I moved onto Pyvis, which captured several important features like physics-based modeling and scalability, yet it was not compatible with the data points we hoped to show in the graph. (The Pyvis model is shown on the right.)
I decided on the D3.js library, which met all our criteria for an ideal visualizer. We now have several working features, many of which can be easily customized: force modeling, grouped nodes, fisheye zoom , right-click on nodes/links , and data-defined visual aspects of nodes/links (Refer to gallery at the bottom for an image of each feature).
How the program is run:
Code snippets - the important functions:
Link finder - finds every connection for the desired nodes
Data sorting methods - reduces large DF to small list
If a slide-show style is preferred, here is the link to the google slides
Right-click on link:
Right-click on node:
Varying size of node based on count:
Weighted Links, shows relevance: