This year, CIPA Congress is tackling all things AI. Together with Ben Hoyle (Hoyle IP Services Ltd), Coreena Brinck (Two IP) and Julio Fonseca (ASML), this Kat has the pleasure of speaking on a panel at Congress focused on the intersection between data, IP and AI, "Data is the new oil", chaired by Greg Corcoran (Greg Corcoran IP). For the avoidance of doubt, the following are this Kat's own views and do not represent the views of the rest of the panel.
The concept of data as oil has been around for a number of years, but does the analogy still hold? Coming from the perspective of the pharma and biotech industry, this Kat sees that there is now a shift away from thinking of data as a bulk commodity of raw material, towards the pursuit of high quality data that can improve the performance of AI models. This shift has important implications for IP strategy and licensing provisions relating to data.
Data was the new oil
Data is a very broad term. At a very basic level, data can be defined as a collection of facts, figures, and statistics. As the Google Books Ngram Viewer reveals, the term "data" only really came to prominence in the 1990s, followed by a steep rise in usage up to the early years of the millennium. The increased use of the term data correlated with the vast increase in information that became available to society from the 90s onwards, brought about partly by the internet and partly by technological advances in computing and scientific discovery (such as genetic sequencing technologies).
The concept of data as "the new oil" was brought to popular attention by a headline in The Economist back in 2017 that proclaimed that The world’s most valuable resource is no longer oil, but data. Interestingly, since then, much of the discussion about data and its use in the economy has often focused on the negative connotations of that phrase, particularly the exploitation of personal information by big tech. However, for the science and technology sector that the IP industry serves, the analogy of data as oil holds a different meaning. The explosion in the scale and complexity of information that arose at the start gave rise to its own field of big data (which became a buzzword around 2007) and associated specialisms devoted to analysing and processing this data (anyone remember systems biology?).
Importantly, this big data provided an essential bedrock for the AI systems which were to follow. The modern field of machine learning is fundamentally dependent on vast amounts of data to learn, improve, and make accurate predictions. There would be no Nobel-prize winning AlphaFold without the public data bank of protein sequences (IPKat), and there would be no Large Language Models (LLMs) without the vast quantity of language data available on the internet. These huge repositories of information were essential to train the models that are the foundation of modern AI.
![]() |
| From oil to gemstones |
The different types of data
The data that has fuelled the field of AI in the form of training data takes multiple forms, spanning numerous formats and scientific domains, depending on the type of AI. For early visual AI models like Convolutional Neural Networks (CNNs), the primary data consists of images. This could include, for example, medical imagery such as breast scans and MRI scans used for detecting cancer, as well as visual, infrared, and radar images collected from aerial surveillance for applications like disaster management. Importantly, the generation of AI models used in scientific discovery relies on vast datasets such as genomic and proteomic information from gene sequencing, data from the Protein Data Bank containing protein structures, and simulation results from semiconductor chip design. For generative AI like LLMs, the training data is predominantly text and code, often scraped from the public internet.
Beyond the initial training data, another crucial category of data in machine learning revolves around the performance and refinement of the AI models themselves. This includes model optimization data, such as the human feedback gathered during techniques like Reinforcement Learning from Human Feedback (RLHF). In this process, human reviewers rank AI-generated responses or identify errors, and this feedback is then used as data to update and fine-tune the model. The output of the models also constitutes a key data type, which can range from a ranked list of documents generated by a prior art search tool to the predicted 3D structures of proteins produced by AlphaFold. Finally, statistics about a model's effectiveness are a critical form of data used to evaluate performance. For instance, research comparing the accuracy of machine learning systems against human doctors provides quantitative data on the model's utility.
Searching for the data gemstones
We can all accept that big data has been the oil that has fuelled the engine of AI. In the same way an engine cannot run without fuel, AI models are powerless without vast quantities of information to train on. However, the analogy begins to break down when we consider quality. Unlike crude oil, which is a relatively uniform commodity, there is a vast and often unappreciated difference in the type and quality of data available.
However, the persisting popularity of the "data as oil" analogy leads to a persistent misconception that all data is valuable. It is true that, in the early days of big data, the focus was predominantly on volume. However, we now have so much data that a lot of it is not only worthless but can actively harm an AI model by introducing and exacerbating irrelevant noise and biases. Just one example is the patent data used to train LLMs (IPKat). Google Patents contains a huge number of badly written patent applications (and granted patents). As a result, training an LLM on more of this data is unlikely to improve its performance for drafting patent applications.
Consequently, AI software developers are refocusing their efforts on mining data of quality. Cleaner and more relevant datasets are becoming increasingly valuable, and there is consequently an increasing focus on how these can be obtained. In other words, we are now looking for the data gemstones. A gemstone is rare, precisely formed, and valuable. For an AI model for predicting tumours, a data gemstone might take the form of a curated medical dataset where thousands of MRI scans have been meticulously annotated by multiple expert radiologists to identify tumours. For companies developing self-driving cars, valuable data is no longer another million miles of uneventful motorway driving. Instead, it is the rare video footage of a near-miss accident, a complex intersection in heavy rain, or a child running into the road. These high-quality, often rare, data points are the key to improving the performance of AI models so that they can perform more expert tasks, with greater accuracy.
The shift in the focus from data as oil to searching for the data gemstones is just as applicable to data relating to model optimization and performance. There is a similar shift in focus towards identifying the rare, high-value pieces of performance data that can improve the quality of a model's output. In the context of LLMs, for example, hallucinations in highly technical areas are unlikely to be identified and ironed out with a sledgehammer approach using more of the same generalist data similar to that which the model was trained on. Instead, the focus is now on expert annotation and optimization using Reinforcement Learning from Human Feedback (RLHF) (IPKat), and in developing methods whereby errors and hallucinations generated by the models can be identified.
Implications for IP strategy
The shift from data as a bulk commodity to curated gemstones of data has important implications for IP strategy. The value of data currently lies not in the raw information itself, but in the intellectual effort, investment, and expertise applied to curate, annotate, and structure the data. This is also where protectable IP can and does reside.
The most relevant IP for data include database rights, copyright, and trade secrets. In the UK and EU, sui generis database rights protect the substantial investment made in obtaining, verifying, or presenting the contents of a database (IPKat). Database rights can protect the investment put into creating valuable data, for example, by funding an extensive clinical trial and analysing the results with bioinformatic techniques, or by employing experts to annotate thousands of images.
Additionally, in many cases, the most powerful protection lies in treating these high-value datasets as trade secrets. A curated dataset derives its economic value from not being generally known and can be subject to reasonable steps to keep it secret. This is not an entirely new concept. The pharmaceutical industry, for example, has long treated its high-quality clinical trial data, compound libraries, and proprietary assay results as fiercely protected crown jewels.
However, a critical point for businesses to grasp is that the act of identifying, cleaning, annotating, and structuring data for AI model training is a significant value-add activity in and of itself. Many non-tech companies, including pharma companies, may be sitting on vast repositories of raw data but lack the internal expertise to refine the data so as to make it useful for training an AI model. This disconnect could create a significant risk that a company may undervalue their assets in collaborations with tech partners or, on the flip side, why an early partnership with an AI company may be necessary to extract the true value from the data.
Implications for licensing
From an IP licensing perspective, it is important to capture the nuance between raw low-quality and processed, annotated data in licence and collaboration agreements. A standard data licence that grants broad rights to a dataset for a nominal fee is no longer fit for purpose. Agreements must now be far more sophisticated and use precise definitions to clearly delineate what constitutes the licensed curated data as distinct from any raw inputs. Furthermore, contractual terms must address the outputs and applications of the data. This includes asserting ownership or at least rights of access to any improvements, models, or insights derived from the licensed data. Concurrently, strict field of use restrictions are essential to limit the licensee's application of the data to a specific purpose, preventing them from using the asset to train other models that could compete with the licensor's core business. Finally, the valuation of the licence must be recalibrated. The licence fees, royalties, or equity stake should reflect the true value of the curated data as a critical enabling asset, not merely as a cost-of-goods-sold commodity.
Final thoughts
As we will discuss on the Congress panel, the phrase "data is the new oil" remains relevant to our understanding of how today's modern AI models were developed. However, in a supply-led economy, the sheer amount of data that we have now, together with the fact that most of it is of low quality, has decreased the value of generic new data as a bulk commodity that can be mined from the masses. IP strategy needs to recognise this shift and move the focus to protecting the data that possesses the true value, the rubies, emeralds and sapphires that represent the new data gemstones.
Further reading
Reviewed by Dr Rose Hughes
on
Sunday, October 05, 2025
Rating:



What also is worth pointing out is that not all problems can be fixed with RL, SFT, etc, even when an apparently fantastic dataset exists. If there is no reward function/policy that can be constructed, if there is no signal in the training data, if right and wrong is subjective, and so on, then everyone will just end up wasting a lot of compute claiming x% better scores on their own arbitrary evals (with monster error bars) that are flawed from the get go.
ReplyDeleteData is important but what is more important is identifying tasks in domains that can actually be solved by machine learning given a suitable model architecture that has the architectural features designed for that task. Find the architecture that can actually solve then task, then feed it gemstone data. Doing it the other way around is just going to lead to disappointment.
Excellent write up Rose, but I disagree with your conclusion. The power of LLMs/neural networks etc, is their ability to analyse huge amounts of low quality data to find the 'signal', the gemstones etc. It is important to realise they can essentially train themselves if you can give them some sort of definition of what you are looking for.
ReplyDeleteResponding to Anonymous of Monday, 6 October 2025 at 15:07:00 GMT+1, respectfully, we don't need the humans 'identifying tasks in domains that can actually be solved by machine learning given a suitable model architecture that has the architectural features designed for that task.' (quote from the comment). That would mean the actual power of the neural network is not being used properly. I would say tentatively (without criticising) that the article does not properly appreciate the power of the maths in the neural network which is incredibly good at finding any signal that is present. The neural network solves the problem of working out the system (i.e. finding the pattern/equation/algorithm) that can be the basis of using the signal in that data.
Hi Santa - your contribution is welcome as always! I would agree with with respect to pre-training, but not in post-training. There is a view that pretraining is basically "solved", such that the focus of many of the major LLM labs at the moment has shifted to finding the really good quality data for RLHF, RLVR (Reinforcement Learning from Verified Rewards) etc., as opposed to more pretraining with low quality data. See, for example: https://prajnaaiwisdom.medium.com/llm-pre-training-vs-post-training-why-the-second-half-matters-more-than-you-think-6a9941a00421. For this, the data gemstones are needed to feed the models.
DeleteSanta, that view is somewhat naive. Do you really think the transformer in its original form just appeared out of thin air when applying some undefined "machine learning" to some arbitrary collection of different NN components such as MLPs, convs, activation functions etc? Do you think the variants of the attention mechanism that have become common place in LLMs and other architectures arose out of magic by applying RL? Of course not. Each architecture has features that have a very clear purpose and if you follow the story in the literature of the development of each architecture and its subcomponents you can identify very clearly why certain features are designed the way they are. Absolute positional embeddings leading to relative positional embeddings leading to RoPE are a great and easy example of this.
DeleteEvery architectural feature is very carefully designed to try to overcome a problem arising from the specific task the architecture is being used for.
The wrong architecture for the wrong task will not learn effectively. This has been proven time and time again. Architecture first. Data second.
To quote you: " I would say tentatively (without criticising) that the article does not properly appreciate the power of the maths in the neural network which is incredibly good at finding any signal that is present. The neural network solves the problem of working out the system (i.e. finding the pattern/equation/algorithm) that can be the basis of using the signal in that data. "
This is what most would call neural architecture search, which is a field in its own right that unfortunately does not produce architectures that are even remotely competitive in most fields with architectures designed for their underlying purpose.
Thank you both Rose and Anonymous. I am sure everything you have said is correct and I don't have the expertise to comment on it in any meaningful way. Thank you for engaging and correcting me, and pointing out the present thinking and approaches in the field. However to simply provide a perspective for the unrealistically optimistic, I believe that in time AI will show us how to get more from low quality data and how to go about constructing the core machine learning system itself, with less and less human help. That is based on my understanding of the underlying maths of the neural networks, which again I am not expert in.
Delete