The end of code – long live data

Training software like a dog

The science of Artificial Intelligence (AI) employs analysis of the past patterns in order to predict future occurrences. Machine learning – a core element of the AI – algorithms, continuously receive input data and by using various techniques arrive at predictive conclusions and update them on an ongoing basis.

Importantly, accuracy in predictive capabilities increases not because of the way that the software is written, but because it is ‘trained’ to be effective. For instance, an algorithm is thought to recognise a certain real-life object by being shown a plethora of visuals of that object. Similarly, a social media platform learns user preferences and predicts his/her news feed preferences. Software is trained by feeding data into it. Lots and lots of data. The more data it gets exposed to, the more sophisticated its AI IQ is.

A couple of years ago Wired Magazine published an article, which unequivocally summarises: 
“Our machines are starting to speak a different language now, one that even the best coders can't fully understand. […] With machine learning, the engineer never knows precisely how the computer accomplishes its tasks. The neural network's operations are largely opaque and inscrutable. It is, in other words, a black box. […] If in the old view programmers were like gods, authoring the laws that govern computer systems, now they're like parents or dog trainers. And as any parent or dog owner can tell you, that is a much more mysterious relationship to find yourself in.”
Data is the BETTER THAN new oil

Unsurprisingly, data has widely been referred to as “the new oil”. However, Bernard Marr (Forbes) disagrees, because data trumps the value of oil in so many ways: it is in effect infinitely durable and reusable; can be replicated indefinitely and instantly moved around the world at a very low cost; becomes more useful the more it is used and once processed, data often reveals further applications. As well, data variety is endless: it can represent words, pictures, sounds, ideas, facts, measurements, statistics or anything else that h can be processed by computers into strings of 1s and 0s that make up digital information.

Fascinated by the new horizons
Data indeed has become a valuable commodity, which may be owned, sold, licensed or shared in various formats. Numerous commercial dataset offerings are available in the market, from weather or marketing information to house price fluctuation data, see here. Nevertheless, the majority of commercial solutions relate to data governance and data modelling rather than raw data. In addition, another common theme underlying data business revolves around data pooling activities, see data co-op.

Furthermore, free datasets are available through community-based projects, such as Kaggle or ThirtyFiveEight, organisations like UNICEF or World Health Organization, or  governments (see  UK's Find Open Data, Ireland’s Open Data Portal, or The European Data Portal). 

Data ownership considerations

Discussions around proprietary rights in data are as complex and diverse as the phenomenon of data itself. Can data be owned at all? Who owns it? And what is the scope of the ownership monopoly? What legal mechanism would be best suited to protect data ownership? What implications do personal data protection and competition/antitrust laws may have on data ownership regime?

It may be considered prejudicial to the public interest to grant a monopoly over control of data to a specific entity or individual, thereby depriving (i) users from free access to ‘building blocks’ of knowledge and innovation, (ii) data itself from being enriched and improved, and ultimately (iii) society committed to ‘the progress of science and useful arts”. Against this view, one can also argue that protection of data producer interests, coupled with the creation of an economic incentive in data production, are equally justifiable goals.

The entire debate boils down to categorisation of data. There is machine-generated/industrial data or data involving (usually wholly automated) measurements. And then there is processed data that results from a skilled analysis, such as inferences about person’s online behaviour.

Copyright law is clear – no protection is available for raw facts or discoveries – only for and to the extent of an original expression thereof. However, Teresa Scassa argues that the concept of fact should not be conflated with data and more robust protection should be offered to data than facts:

“often overlooked feature of data is their non-neutrality […], [d]ata inherently reflects choices – choices about which choices about which data to collect (or to exclude) and what tools or parameters will be used in their collection. […] These choices reflect the human agency present in the creation of data.”
Notably, in a few instances the courts have reached a conclusion that certain data might overcome the threshold of copyrightablity. In New York Mercantile Exchange, the US Court of Appeals for the Second Circuit determined that the settlement prices that are continuously generated by computer algorithm were sufficiently creative, but failed to attract copyright protection due to the merger doctrine. More recently, a similar conclusion was reached by the US District Court in BanxCorp v Costco.

To address the shortcomings of copyright law, it has become a prevalent business practice to seek contractual protection of data as confidential information. To avoid any risk taking, most commercial agreements label data as proprietary to the data originator and then go on to also specify that any proprietary materials are confidential information, hence subject to use and dissemination restrictions. In contrast to copyright, confidentiality provides a strong protection for data both in its scope and duration. However, this legal vehicle may not equally serve all types of data.

The European Union has demonstrated the preference for a sui generis route for data protection. The Database Directive introduced a new database right, the effectiveness of which, however, continues to be questioned, see here.

In 2017, the European Commission published a working document and subsequently a position paper on the proposal for a new right in machine-generated and non-personal data that is not arranged into a database - the so called data producer’s right. It would encompass—
“the exclusive right to utilise certain data, including the right to licence its usage. This would include a set of rights enforceable against any party independent of contractual relations thus preventing further use of data by third parties who have no right to use the data, including the right to claim damages for unauthorised access to and use of data.”
Professor Bernt Hugenholtz is not that optimistic about how such right might fit in the IPR realm:
“[I]ntroducing such an all-encompassing property right in data would seriously compromise the system of intellectual property law that currently exists in Europe. It would also contravene fundamental freedoms enshrined in the European Convention on Human Rights and the EU Charter, distort freedom of competition and freedom of services in the EU, restrict scientific freedoms and generally undercut the promise of big data for European economy and society. In sum, it would be a very bad idea.”
Free as in freedom?

This Kat finds it curious drawing parallels between open source software and open data. The complexity of software development has led to a community-based collaborative endeavour. As open source guru Eben Moglen points out
“There’s no organization in the world that could manage eleven million lines of code generated over eight months. There’s no possible way that we could understand how machine learning systems work, given that their architects can’t understand them because all they do is they throw neural nets together in a cookbook, and stuff comes out. There’s no way that all these projects could manage that reduction of fragmentation that you’re talking about unless they could all see one another’s code and understand what the common denominators were in their technical approaches. We’d lose our minds if we didn’t have open source.”
Similarly to software, data is capable of being improved and modified by its users and copyleft-like licensing with respect to such modifications is no longer a distinct utopia. Linux Foundation has moved towards that direction by introducing two Community Data License Agreements, the intention of which “is for contributors and consumers of open datasets to actively use and support the contribution of data in a uniform fashion, while clarifying the terms of that sharing and reducing risk”. A Sharing License encourages contributions of data back to the data community and a Permissive License imposes no such sharing requirement and allows anyone to use, modify and do what they want with the data.

Image Credits: Water&Politics and fourth waveby
The end of code – long live data The end of code – long live data Reviewed by Ieva Giedrimaite on Tuesday, April 30, 2019 Rating: 5


  1. Based on the statements here, that data is better than oil, I wonder if data is a Giffen Good? These are unicorns of economics - commodity goods (i.e., not luxury goods) where consumption rises as the price rises. There aren't very many (or any?) perfect examples of Giffen Goods, but in this case, the value of data increases as the amount of data increases. It seems to fit....

  2. You mean "FiveThirtyEight", not "ThirtyFiveEight"


All comments must be moderated by a member of the IPKat team before they appear on the blog. Comments will not be allowed if the contravene the IPKat policy that readers' comments should not be obscene or defamatory; they should not consist of ad hominem attacks on members of the blog team or other comment-posters and they should make a constructive contribution to the discussion of the post on which they purport to comment.

It is also the IPKat policy that comments should not be made completely anonymously, and users should use a consistent name or pseudonym (which should not itself be defamatory or obscene, or that of another real person), either in the "identity" field, or at the beginning of the comment. Current practice is to, however, allow a limited number of comments that contravene this policy, provided that the comment has a high degree of relevance and the comment chain does not become too difficult to follow.

Learn more here:

Powered by Blogger.