For most of us, the word graph brings back memories (not always pleasant) of pencils and rulers and quadrille ruled paper, though a more recent generation may think instead of pie and bar charts produced in Excel from spreadsheets. However, in the last few years, another form of graph, one that goes back to a whole branch of mathematics called graph theory, is beginning to have a huge impact in business, science and the world of artificial intelligence.
A graph data is a class of data store that has been flying under the radar for some time now but increasingly are emerging as some of the hottest trends in data management. It makes use of relationships and connections as much as tables and indexes.
Graph representations are not new. Indeed, graphs figured pretty heavily in early medieval mysticism, including the Hebrew Kaballah:
On the other hand, graphs also play a significant role in contemporary physics, including such areas as the standard model of particle physics:
In the realm of computing, graphs are everywhere. Indeed, it can be argued that, at least for a broad subset of computer programming, graphs underlie almost all databases produced since the introduction of the computer in the 1940s. However, pure graph databases (rather than just ones that can be modeled by graphs) have only been around for the past decade or so, divided primarily into two distinct types: semantic (or triple store) databases, and property graph databases. In order to understand the distinction, a little bit of background on database theory in general is useful.
The Origins of Graph
In a traditional relational database, information is stored in tables, where the rows of each table correspond (roughly) to a particular thing or resource, and the columns correspond to the properties acting on those resources. This distinction is somewhat fuzzy in that tables and classes of objects aren’t completely synonymous but it works to a first approximation. The values where a particular row and column intersect can essentially be of one of three broad types: a scalar value (such as a revenue number or the text contents of an article), a primary key, or a foreign key reference to a primary key.
The primary key is an identifier for the resource itself. Many databases use a mechanism in which every time a new row is created, an integer pointer is incremented and used as the key. Because rows can be deleted or merged, most relational tables do not quite have a one to one correspondence between row and primary key. Sometimes it is possible (though not necessarily wise) to bypass creating a primary key altogether, and the key for the table is a composite key.
A join, in relational parlance, matches a foreign key (or more properly, key reference) in one table with a primary key in another table. By using a particular language syntax (SQL), you can create a join by telling the database what table a given key reference column indicates, then matching the foreign key indexes (the integers) to the primary key indexes in the target table. An Excel user could think of this as creating a lookup table on a separate sheet in a workbook, then using lookup techniques to get detailed data about something without having to repeat the same information over and over in the spreadsheet.
When Tim Berners-Lee (TBL) was first mucking about trying to write an abstracts system (what we’d call a content management system today) at CERN in the late 1980s, he came up with a truly ingenious invention. He built a server on a target machine that would take an address (that famous
http: protocol) and retrieve a document on that server that was referenceable from that address. Put another way, he mimicked the way that a database stored its references, then used
http: as a way of uniquely specifying where that resource was.
On an HTTP client (what would eventually be known as a web browser), Berners-Lee embedded hypertext links that used that same protocol so that, when clicked or otherwise activated, would tell the browser to fetch the content and replace the current display page with the new one. Because the display page used a very general language (HTML), early users of TBL’s browser were able to jump from link to link to link, pulling in everything from physics abstracts to lists of CDs a given student had to (eventually) collections of porn.
The earliest web pages included both an
href attribute that specified where a given document was linked to and a separate
rel attribute that described the relationship that the linked document had to the current one, in much the same way that a column head describes a database link. However, because there was no real formal use for the rel attribute (especially once automated editors replaced hand editing for HTML), it atrophied except in the HTML
The semantic web, in which TBL also played an outsized role, took advantage of this similarity to describe a database abstractly. The combination of row identifier, column name (or identifier) and cell or link became a vector, with the table corresponding to a class. This relationship of row id (called the subject), column name (the predicate) and the cell value (the object) then became known as a triple, the vector pointing from subject to object (as nodes) connected by an arrow (or edge).
This arrangement of triples, with the object with one triple becoming the subject of another, is known in mathematical circles as a graph. There’s actually a very well established branch of mathematics called graph theory that covers the characteristics of graphs, but one of the most useful aspects of such graphs is that they can describe just about any data structure used in computer science. A linked list is a graph, as is a table, as is a tree. They underlie how 3D meshes are defined, to the extent that a graphics processor unit (or GPU) can be used to optimized graph operations, and they are becoming the preferred means for storing information in the wild, wooly world of AI.
While there are a number of different types of graphs, the last few years have seen the emergence of two distinct subtypes: Semantic graphs (RDFGs), and property graphs (PGs). Semantic graphs are now offered by a number of vendors, including Top Quadrant, Apache (Jena), AllegroGraph, OntoText, MarkLogic, Stardog, Virtuoso, Pool Party, Amazon Neptune, and others. They incorporate a logical model called the Resource Description Framework (or RDF), and are used mostly for inferential analysis.
The second type of graph store, property graphs, generally do not store RDF, and differ from semantic graphs in two critical ways. Nodes can store attributes (scalar values) and properties that connect nodes can also store attributes.
This distinction is mostly meaningless in the case of attributes on subjects or objects — it affects how information is optimized in storage, but there’s an explicit distinction in how literals vs. links are expressed in RDFGs that are equivalent to property graph attributes.
Where things differ is in the treatment of predicates. In a semantic graph, it is possible to create a triple where a predicate identifier is itself the subject or object of another triple. This technique is used all the time for annotating or providing schematic information about properties, but in general, what is described is information about the property itself, not about the property relative to a given subject or object. This is consistent with RDF as a language for describing resources, even when those resources are properties.
In a property graph, on the other hand, the predicate can hold attributes that are specific to each predicate, relative to a given subject or object. For instance, if I have a statement:
then in a property graph I can assign an attribute to just that
person:isMemberOf predicate (but no others) indicating a confidence level of 75%. Now, this is not to say that you can’t express this information in RDF — you can do it with blank nodes readily enough — but what is then being described is more complex.
RDF takes the tack of saying that when you say that the property
person:isMemberOf has a confidence level of 75%, what you’re really doing is making an assertion about an assertion. For instance, the above could be expressed as:
People who work with semantics call this a reification, which means that it’s a statement about a statement. Reification has been an uneasy part of semantics almost from the beginning, but reification hasn’t been formalized in the RDF language (or in SPARQL, the semantic query language) because most proposals caused an explosion of triples as it was believed that every single assertion needed to be reified. Property graphs get away with this because they rely upon lexical matching (in some cases) of predicates, but it also limits how well they can deal with scaling relationships.
Put another way, property graphs are better at traversal, while semantic graphs are generally more optimized for metadata.
The Emergence of RDF* and SPARQL*
Realistically, though, most graphs have the need for both requirements. Graph traversal becomes critical when attempting to determine envelopes — find me all resources within six hops of the current resource, regardless of how they’re related, as an example (a query that’s immortalized in the parlor game Six Connections to Kevin Bacon). Similarly, reification does become important when talking about the reliability of assertions. There are also many stochastics problems, such as doing Bayesian analysis or minimizing network path lengths (which is another form of energy envelope minimalization), that requires the ability to reify content.
Recently the W3C held a symposium on the unification of property graph and RDF graph standards. While there were a number of very interesting discussions and efforts that came out of this), one of the most exciting was a note submitted by Olaf Hartig concerning the unification of RDF and property graphs, one that has now become the cornerstone of a new chapter of RDF and SPARQL.
His proposal was fairly simple, but with far-reaching consequences: Treat a triple as a resource. The above example actually shows the syntax that this would take:
The double angle brackets wrap an assertion and make the whole thing a subject, and the assertion (statement) in turn can then be clarified. This has a number of implications:
Because RDF does not recognize the distinction between two statements that have the same subject, predicate, and object, you can make assertions, then later annotate or clarify them:
Here, we have two statements, one centered around Jane’s membership in an organization, the other, her birthdate. The reified statements allow us to make annotative assertions about these properties, without the need to create such confidence or quality metrics in the base classes, reducing the underlying complexity of the modeling.
One of the big problems that banks and other financial institutions run into is that bi-temporal information — knowing when a given property is known, vs., when it happened — is metadata that is very hard to capture even in relational databases, and surprisingly hard even in semantic graphs. By recognizing that the culprit for this complexity is reification strengthens the case for making it a key part of RDF.
A second implication is that you can use placeholders for assertions:
This becomes auditable information. I can query this RDF and find out which statements were explicitly made by John Deere, and use it, in turn, to perform analysis on the verity of those statements.
On the subject of query, Olaf also recommended the same kind of syntax be made available for SPARQL as for RDF. For instance, the following query would retrieve all of the resources and properties by the person who made them, ordered by date:
One impact of this is that with such a system enabled, writing a system that transformed Gremlin/TinkerPop (the predominant property graph language) into Sparql* should become straightforward, without significantly breaking the RDF or SPARQL models.
A Call for Variable Predicate Paths
There are other things that can be done to bring property graphs and semantics ones in line. One of the key ones is to lift the restrictions on the use of variables within predicate paths. Currently, you can either have a SPARQL variable or a fixed predicate path, but there’s no way to have a predicate path with variable components, something on the order of:
This would return all predicates and objects that originate with a given subject and are between one and five hops away.
This capability was discussed in the lead up to SPARQL 1.1, but was not implemented because there were too many vendors that were concerned about performance impacts. However, six years later, servers have become fast enough that this argument has little real merit.
It may very well be that something similar to the above idea, perhaps the generation of a named(?) tree-graph that can be queried for depth or sequencing could be incorporated into the next generation of SPARQL. What’s perhaps as significant is that interest in revisiting these specifications is also rising. Two other recent innovations, the emergence of GraphQL as a language for traversing JSON based property graphs and the rise of JSON-LD as a de-facto standard for JSON expression of RDF are changing the nature of semantics.
RDF* and SPARQL* may prove, over the next few years, to be the critical lynchpin for keeping to closely related technologies from forking irretrievably. By treating reifications, assertions about assertions, as first-class objects in the Semantic Web, the idea of interchange between property and semantic graphs and the ability to work with graphs using a much broader set of tools opens up. Hopefully, recent developments within the W3C and elsewhere may point to a detente at a minimum, and perhaps even significant innovation in the field to come.