Entanglement

Embarrasingly-scalable graphs

Project Description

micro-graph

Entanglement is an embarrassingly-scalable platform for graph-based data mining and data integration, developed by the integrative biology group at Newcastle University, allowing us to integrate datasets that were intractable using previous technologies.

Background

Bioinformatics and biomedicine have a long history of using graph-based approaches to data integration. These have used a mixture of standard technologies (e.g. RDF, OWL, SQL) and more custom solutions (e.g. ONDEX, InterMine). While graph-based approaches have proven very successful, they tend to run into scalability issues at some point.

At the same time as the bio datasets have been growing, grid and cloud services have been maturing. These essentially remove the hardware scalability issues, allowing the design and employment of ‘scalable by design’ software architectures, such as the ubiquitous deployment of disposable virtual machines, and noSQL databases like MongoDB.

Entanglement has been designed to address this space. Everything about it is designed to support scalability.

Architecture

ArchitectureThe entanglement architecture embraces grid environments, being built from symmetric VMs. Hazelcast and MongoDB provide scalable in-memory and persistent data storage. On top of this is layered a highly-performant graph API, capable of managing very large graphs with minimal performance degradation.

Individual Entanglement graphs are packed into MongoDB collections, with documents representing both graph elements (nodes and edges) and the log of operations that built those elements. Several packings are supported, based upon if the graph is being actively modified or is sealed, how large it is, and indexing options. This allows Entanglement to scalably handle low-level storage and lookup of individual graphs with very many nodes and edges.

Graph Data model philosophy

The key principles of the Entanglement data model are to embrace: multiple identity, integration over aggregation, missing or incomplete data, messy data blobs, partial data processing.

Multiple Identity: Entity identity is one of the key issues in data integration. Within a tightly-controlled data model, entities are assigned identity, for example, as a database primary key. However, when integrating across multiple data models, single entities will typically have many identifying keys. Entanglement embraces this by associating each nodes and edge with a keyset. This keyset is a collection of uniquely-identifying data for that node or edge. This may include internet-unique URIs, domain-specific identifiers or accession numbers, co-ordinates, or any other data fields that provide this datum with an identity. Two keysets match if any one of the identifying keys match. Two nodes or two edges with matching keysets can be merged, and edges refer to linked nodes by matching keysets.

Integration over Aggregation: Legacy data-integration and data-warehousing platforms have a tendency to push the domain modeller towards early aggregation, pulling multiple data sets into a single schema and data store early. Entanglement takes the opposite approach, encouraging aggregation to be deferred for as long as possible. Best practice is to import each data set into its own graph, representing only the data in that data set, producing integrated graphs for ad-hoc querying. Integrated graphs are only materialised as aggregated graphs for export, or when down-stream processing requires these materialised views for performance reasons. The graph integration process is extremely light-weight, allowing clients to include or exclude individual data source graphs on a whim.

Missing or Incomplete Data:  Legacy data-integration systems typically require all data referred to by the warehouse to be present in the warehouse. Entanglement allows a graph to refer to any node or edge by a matching keyset, regardless of it is present in that graph or not. Even when edges refer to nodes not present in their graphs (dangling edges), it is often possible to answer complex queries by finding other edges that refer to matching keysets, allowing graphs to work with missing data. When graphs are integrated, some previously dangling edges may now resolve to known nodes. Alternatively, they may match to keysets that provide additional identifying keys, allowing transitive keyset matching to collapse the graph down further. By embracing missing data in this manner, many expensive graph data-integrity checks can be postponed, further enabling high performance import operations.

Messy Data Blobs: Bioinformatics data is often semi-structured. For many applications, it is sufficient to package up this semi-structured data in a semi-opaque blob, and just link it to related data blobs. Unlike RDF, where all data must be decomposed into triples to be visible to tools, Entanglement encourages data importers to keep the blob-like structure of the data. Both nodes and edges can be full json documents, with nested structure, which is not visible to and does not take part in the graph topology, but which can be used to filter the entities.

Partial Data Processing: Entanglements encourages data import to do the minimal work needed to get entities into a graph, identified, and linked via key relationships. Domain- and application-specific processing can post-process these blobs and build new graphs containing additional edges between nodes, or decompose a node into more complex structures as needed. By placing the results of this additional processing into their own graphs, it is possible for applications to choose the level of detail they require for a given kind of query, by including or not including these finer-grained graphs in their integrated view. This goes a long way towards solving some of the scalability issues inherent in legacy graph-based solutions, where the granularity of the schema must be chosen up-front, and will always be either too fine or too coarse for any particular application.

Scalability

entanglement-multiple_graphsEverything about Entanglement is focussed upon scalability.

  • Scalable storage: the data is sharded across a MongoDB cluster, giving arbitrary data storage scalability. You can never run out of disk space, and store/retrieve scales linearly with graph size.
  • Scalable compute: graphs can be populated in parallel on multiple worker nodes, enabling large jobs to be farmed out over local CPU farms and commodity compute providers. If your problem is big, throw more CPUs at it.
  • Scalable scenarios: the graph data structures themselves support git-style fork-and-merge semantics, drastically reducing the costs of ‘what-if’ scenario planning. Want to try a thousand scenarios? No problem! Want to combine the best three? Just merge the graphs.
  • Scalable data structures: the graph API uses structure-sharing, persistent data structures, giving unlimited undo-redo, and the ability to make very many similar graphs at almost no extra cost.
  • Scalable semantics: all graph updates are captured in a log. These updates have well-defined operational semantics that allows us to compile them down to the most efficient form possible. No more need to tune how your application builds graphs to get the best performance out.

Distributed

All operations are designed to be distributed. An Entanglement session can be interacted with by any number of users and software agents. This supports real-time, collaborative data integration and data mining, in a way not supported by any other system.

  • Distributed querying: a single application-level query may be broken down into pieces that are answered in parallel by multiple servers.
  • Distributed data import: many software agents in multiple locations can collaboratively build graphs or collections of graphs. This allows the, often expensive, overhead of data parsing and cleaning to be off-loaded from the database hosts and end-user machines.
  • Distributed data mining: many bots and humans can mine the same graph or integrated collection of graphs, looking for patterns, calculating summary statistics, or performing application-domain specific reporting.
  • Distributed visualisation: data selections and points-of-interest are be shared between all users in a session, providing a collaborative space for data mining and visualisation. As one user moves about a large graph, the visualisation for other users in the session can track this. As queries flag portions of a graph as interesting, all users in the session are notified of this, and their local visualisations can be updated accordingly. Each local visualisation can be customised to view a different subset of the data or render it in a different, or multiple ways, supporting both an experience that is at the same time collaborative and personalised.

Use cases

Entanglement is being used in a number of projects within the int-bio group. Each project is different, keeping Entanglement from becoming too specialised on a single kind of data or domain.

Use case 1: ARIES

The ARIES project integrates many different data sets, including some which are publicly available, and others that are restricted-access medical data. Entanglement allows us to parse each of these data sets into their own graphs, and then expose them to individual clients based upon their security access. The on-the-fly data integration capabilities of Entanglement allow each user to integrate over all of the data that they have access to, while keeping them from seeing that which they do not. The larger integrated ARIES graphs contain billions of nodes and edges, and so far Entanglement is supporting import, querying and export operations in linear time, even for the largest integrated graphs.

Use case 2: Drug Discovery/ Repurposing

One of the Drug Repurposing Project approaches is to look for semantic network structures indicative of repurposing opportunities. This involves importing a range of data sources that may contribute knowledge relevant to the task, finding semantic network structures (networks of labelled nodes and edges) that appear to be associated with known drug repurposing examples, and then searching for other examples of these network structures as potential repurposing opportunities. Previously, we have used ONDEX to perform these analyses, but as the size of the underlying data sets and their number increased, scalability was becoming an issue. Using ONDEX-to-Entanglement wrappers, we are able to use the existing ONDEX parsers to populate Entanglement graphs, and then perform the graph integration, network structure mining, and searching in Entanglement. This gave us all the benefits of Entanglement, while leveraging the significant efforts put into the ONDEX parsers.

Entanglement is developmed as part of the ARIES project, funded by the BBSRC