2015-01-28

Data Patterns, continued

Follow-up of the previous post, still trying to make sense of this pack of untranslatables : pattern vs schema vs structure vs model, and in particular how to draw the fine line between their descriptive and prescriptive aspects ... without spamming anymore the DC-Architecture list with this discussion with +Holger Knublauch which has somehow gone astray ...
Looking at pattern in the Wiktionary yields a lot of definitions, among others the following ones, broad enough to fit our purpose.
  • A naturally-occurring or random arrangement of shapes, colours etc. which have a regular or decorative effect. 
  • A particular sequence of events, facts etc. which can be understood, used to predict the future, or seen to have a mathematical, geometric, statistical etc. relationship. 
Further on in the same source, I discover that pattern can also be used as a verb (to pattern)
  • To make or design (anything) by, from, or after, something that serves as a pattern; to copy; to model; to imitate.

To discover, recognize, classify and name patterns in the world is a basic activity of our brain, and the very basis of our knowledge. Are those patterns emerging in our brains and projected on reality? Or does the world really signifies something to us (in the sense of the French faire signe) with those patterns, pointing to some internal logic and maybe meaning? I will keep agnostic here on this deep question, and rather look at an example which will bring us back to the questions of patterns in data.
What do we see in this image? Objects of various shapes, sizes and colors, connected by edges apparently not oriented. Some would call it a graph. Can you see any pattern? A casual look might miss it, and say those shapes, colours and sizes are rather random, their distribution is not really regular, although there are some vertical and horizontal alignments, groups of objects of the same color, and other groups of the same shape. A mix of order and random, like in the real world. Looking more closely, you will notice that connected objects share either a common color, or a common shape, or both (like the two red rectangles). This I will call a pattern.
We can now try to describe those objects in RDF data, using three predicates ex:shape, ex:color and ex:connected, and check if the pattern is general.

:blueMoon1  
    ex:shape  "moon";
    ex:color "blue";
    ex:connected  :blueTriangle1 .

:blueTriangle1  
    ex:shape  "triangle";
    ex:color "blue";
    ex:connected  :blueMoon1, blueEllipse1, redTriangle1 .

etc.

The pattern can be checked over the above data using this query

SELECT ?x
WHERE 
{
  ?x  ex:shape ?xShape.
  ?x  ex:color ?xColor.
  ?y  ex:shape ?yShape.
  ?y  ex:color ?yColor.
  ?x  ex:connected  ?y.
  FILTER (?xShape = ?yShape || ?xColor = ?yColor)
}

This query should yield all objects in the graph. If there is a handful of exceptions out of thousands of objects, I will certainly consider this is a general pattern, with some exceptions I will look closely at for further investigation. If this pattern is observed for, say, 60% of nodes, I will certainly consider it a frequent pattern. If the result is less than 10%, I will tend to consider it as a random structure rather than a pattern. All this activity is descriptive, with possible predictive purposes. I might have queried a part only of this graph because it has billions of objects, and assume the pattern is extending to the rest.

Can I turn this pattern into a prescriptive rule? Sure enough. If I want to create a new object connected to the yellow triangle at the bottom right, it has to be either a triangle (free color), or a yellow whatever (free shape), or both. But ... may I introduce new colors and new shapes, such as a yellow star or a purple triangle? In an open world, this is not forbidden by my pattern. But my closed system can be more restrictive, and limit the shapes and colors to those already known. 

I'm pretty sure that people asked to extend this graph, even after discovering the underlying pattern, will wonder for a while whether they are allowed or not to introduce a yellow star or a purple triangle, because neither star or purple appear in the current picture. It's likely that the most conformist of us will interpret the open pattern into a closed world schema, where objects can have only the shapes and colors already present. Not to mention the size, which has not been discussed, and not represented in the data. Imaginative people, certainly many children will take the open world assumption to invent freely new shapes with new colors, maybe joyfully breaking the pattern in many places. Logicians will be stuck in wondering which logic to use, and are likely to do nothing but argue why at length with each other.

What lessons do we bring home from this example?
  • Patterns can be discovered in data, or checked over data. 
  • The same observed pattern can be turned into an open world rule or included in a closed world schema, and there is not generally a single way to do either of those.
  • We should have a way to represent and expose patterns in data, independently of their further use. The current RDF pile of standards has nothing explicitely designed for such representations, but  SPARQL would be a good basis.
  • Patterns are not necessariliy linked to types or classes of objects. In our example, no rdf:type is either declared in the data or used in the SPARQL query.
For those who read French see also this post on Mondeca's blog Leçons de Choses Le toro bravo et le top model dated april 2010, showing those ruminations are not really new. 

2015-01-26

The case for Data Patterns

The W3C RDF Data Shapes Working Group has hard time trying to name the technology it is chartered to deliver. A proposal by +Holger Knublauch for Linked Data Object Model has triggered a lively discussion even outside the W3C group forum, on the Dublin Core list where +Thomas Baker has supported and pushed further my suggestion to use data pattern instead of shape, model or schema in various combinations with linked and object. Since this terminological proposal has over the week-end made its way to the official proposal list, maybe it's time to justify and explain a bit more such a terminological choice, and what I put technically under this notion of pattern
I must admit I've not gone thoroughly through the Shapes WG long threads wondering, among other tricky questions, about resources and resource shapes, or if shapes are classes, and maybe the view I expose below is naive, but the overall impression I get is that all those efforts to ground the work on RDFS or OWL are just bringing about more confusion on the meaning of already overloaded terms. A parallel discussion has started from a false naive question by +Juan Sequeda on the Semantic Web list a few days ago on how to explore a SPARQL Endpoint. In this exchange with +Pavel Klinov, I take the position that exploring RDF data is looking for patterns, not for schema.
The terminological distinction is important. The notion of schema, or for that matter the alternative proposal model, is heavily overloaded in the minds of people with a database background, and it is on the other hand totally abused in the RDF world. Its use in the RDFS name itself was a big cause of confusion. Not to mention the more recent http://schema.org, which defines anything but a schema, even in its RDF expression. RDFS vocabularies or OWL ontologies are neither schemas or models as understood in the closed world of databases or XML, namely global structures which precede and control the creation and/or validation of data. Using the term schema in RDF landscape is in fact preventing people to grok that RDF data by design has no need for schema. No schema in a RDF dataset is not a bug, it's a feature. And the current raging debates is only showing that people put so many different meanings on schema when trying to use it about RDF data, that you better forget using it all all.
Patterns, on the other hand, can be present in data whether they have or not been a priori defined in a global schema or model. They can be observed over a whole dataset or only in parts of the data. They can be used for query, validation, and even making inferences. But they are agnostic about the various interpretations implied by such usages, they don't abide a priori by any closed or open world assumption.
Technically speaking, how can a data pattern be expressed? To anyone a bit familiar with SPARQL, it is formally equivalent to the content of a WHERE clause in a SPARQL query. Such a content, by the way, is indeed called by the SPARQL specification itself a graph pattern. Let me take a simple example which will meet hopefully en passant an issue expressed by +Karen Coyle, the fact that people (in the Shapes WG) have hard time thinking about data without types (classes). 

Let P1 be the following pattern (prefixes defined as per Linked Open Vocabularies).
{
?x   dcterms:creator  ?y.
?y   person:placeOfBirth ?z.
?z   dbpedia-owl:country  dbpedia:France
}
This pattern does not contain any rdf:type declaration, hence it does seem like a shape under any of the current definitions proposed by the Shapes WG. It is not attached to, even less defined as, an explicit class. It does not rely on any RDFS or OWL construct.
What is the possible use of such a pattern? A basic level of use would be to declare that it is present or even frequent in the dataset (the description of the use of a pattern in a dataset could provide a COUNT to figure the number of its occurrences), which means if you use it as a WHERE clause in a SPARQL query over the dataset, the result will not be empty and will represent a significant part of the data.
Another level would be to associate P1 by some logical connector to another pattern, for example let P2 be the following one.
{
?x    dcterms:title  ?title.
 FILTER (lang(?title) = "fr")
}
One can now constrain the dataset by the rule P1 => P2 (supposing here the variable ?x is defined globally over P1 and P2). Said in natural language, if the creator of some thing is born in France, then this thing has a title in French (which might be a silly assumption in general, but can make sense in my dataset about French works). Note again that there is no assumption on the type or class of ?x and ?p. Of course one can fetch the predicates in their respective ontologies using their URIs and look out for their rdfs:domain to infer some types. But you don't need to do that to make sense of the above constraint. Practically, this constraint would be validated on all or part of the dataset by the following query yielding an empty result.
SELECT*
WHERE
{
?x    dcterms:creator  ?p.
?p    person:placeOfBirth ?place.
?place dbpedia-owl:country  dbpedia:France.
FILTER NOT EXISTS
{?x    dcterms:title  ?title.
FILTER (lang(?title) = "fr")}
}
Not sure how P1 => P2 would be interpreted as an open world subsumption. Supposing you can interpret each of the patterns as some constructed OWL class for the common variable ?x, and write a subsumption axiom between those, not sure such an interpretation would be unique. Deriving types from patterns is something natural language and knowledge does all the time, but not sure if OWL for example is handling that kind of induction. There is certainly work on this subject I don't know of, but it's clearly not "basic" OWL.
In conclusion, I am not claiming that patterns and SPARQL covers all the needs and requirements of the Data Shapes charter, but I hope it shows at least that searching and validating data based on patterns can be achieved independently of RDFS or OWL constructs, and even of any rdf:type declaration.
Follow-up of the conversation on DC-Architecture list.

[EDITED 2015-01-27] After feedback from Holger and further reading of LDOM, it seems that the above P1 => P2 can be expressed as a LDOM Global Contraint encapsulating the SPARQL query, thus :
ex:MyConstraint
a ldom:GlobalConstraint ;
ldom:message "Things created by someone born in France must have a title in French" ;
ldom:level ldom:Warning ;
ldom:sparql """
SELECT*
WHERE 
{
?x    dcterms:creator  ?p.
?p    person:placeOfBirth ?place.
?place dbpedia-owl:country  dbpedia:France.
FILTER NOT EXISTS
{?x    dcterms:title  ?title.
FILTER (lang(?title) = "fr")}
  }
""" .