AI in the patent industry: Don't believe the hype. Believe the data.

Many in the IP profession remain considerably sceptical of AI. AI may be useful for checking for typos and simple calculations of deadlines, but it cannot replace in-depth human reasoning about complex scientific and legal issues. However, the data suggests something different. 

How do we know LLMs are any good? 

Generative AI is an exceedingly competitive field. It is possibly one of the most competitive commercial arenas today. The foundational labs are in a race to produce the best models. This means it is very important to a lot of people, what it means to be "the best model". The competitive nature of the field, and the widespread adoption of AI means that there is thus considerable interest in evaluating and comparing models according to every possible metric. Consequently, there is a wealth of data at our disposal comparing the models. 

The website Artificial Analysis is a good source of independent LLM evaluation data. As a casual perusal of Artificial Analysis shows, there are many possible ways to evaluate an LLM, including hallucination rate, speed, coding ability, ability to follow instructions, and mathematical reasoning. The MATH-500 Benchmark, for example, is a collection of 500 maths problems including algebra, geometry, intermediate algebra, number theory, precalculus, and probability, requiring step-by-step solutions and precise mathematical reasoning (GPT-5 is the current leader, scoring 99.4%). Given the focus on coding in the LLM world, it is unsurprising that many of the benchmarks relate to evaluating this type of ability. However, what we need the models to do in the IP profession, is very different from coding and mathematical solutions.  

Evaluating AI for patent work 

The LLM evaluation that is most relevant to the patent profession, in this Kat's opinion, is the assessment of the models' ability of models to perform long-context reasoning (LCR). If an LLM is going to be any use for the patent profession, it is going to need to understand and reason about long complex documents. As luck would have it, there is an evaluation designed to test this ability. 

The current LCR evaluation for LLMs from Artificial Analysis consists of 100 questions relating to diverse document types, including academic papers, company financials, government consultations, legal documents, industry reports, and marketing materials. The documents average 100,000 tokens each (i.e. about 75,000 words, or 300 pages). The 23 legal documents in the test contributed the most tokens, with an average of 116,000 tokens. According to the Artificial Analysis summary, the LCR eval requires "genuine reasoning" rather than simple data extraction, comprising multi-step analysis to synthesize information from dispersed sections, the ability to understand complex domain-specific content, and clear and unambiguous answers free from errors and hallucinations. 

According to Artificial Analysis, traditionally humans have dramatically outscored LLMs on LCR. Indeed, up until around 2024, even the best LLMs were bad at it. Even the best frontier models, such as ChatGPT, Claude and Gemini, achieved less than 50% accuracy in the LCR evaluation. Given this, it is entirely unsurprising that, if you tried to use an LLM for patent work in early 2024, it probably wasn't very good. In 2024, it was indeed a huge struggle to get LLMs to achieve good outcomes for complex tasks such as patent drafting, prosecution or prior art analysis without a highly sophisticated work flow, a great deal of separate prompting steps and complex coding loops, all of which needed a lot of  underlying programming and software engineering (this Kat knows, she tried). In the world of 2024, AI-wrapper software made a lot of sense. It was also at this time that many of the AI wrapper companies for IP were founded. After all, in early 2024, we needed them. 

The rate of change

According to the data, therefore, in 2024 LLMs were fairly bad at understanding and reasoning about long complicated documents. However, in AI, things change and they change fast, and the LCR evaluation data for the latest models tells us exactly how much things have changed. According to the independent assessment of Artificial Analysis, the best models currently available (ChatGPT 5, Claude Opus 4.6, Gemini Pro 3.1) currently score around 75% on the LCR test, a big jump up from 50%. 

But 75% is still pretty far off 100%, I hear readers cry. However, the key piece of comparative information we need to know is how well humans perform in this test. According to Artificial Analysis, human domain experts also struggle with the test. Whilst human evaluation confirmed that it was possible to answer a question correctly, the average expert human score was typically 40-60% of questions on the first attempt. In other words, with average scores of 75% the best models are now better at long-context reading than the average human domain expert (in a fraction of the time). 

The current ability of the frontier models on LCR tasks means that a lot of the programming and software solutions that the AI-wrapper companies were originally built on, are no longer needed. The frontier models now just do this, by default. Indeed, it is important to keep up to date with the capabilities of the underlying model, so as to avoid over-engineered solutions that actually prevent the models performing well (IPKat). The skill of prompt engineering has become a lot about what you don't need to say, as much as what you do need to say.  

Another key message from the LLM eval data on LCR is that, for patent work, we need to be using the best models. Interestingly, on the LCR benchmark, Grok 4.20 languishes down at 58%, whilst DeepSeek v3.2 is a respectable 65%. If you are using a free version of an LLM, or the "fast" non-reasoning/thinking model, it will be far worse, and probably no better than 50%. However, the better models are also far more expensive per token. If you are speaking to an AI software wrapper company, "what model are you using" is therefore one of the first questions you need to be asking. 

Beyond the wrapper

The demise of the AI wrapper?

A previous IPKat post discussed the increasing redundancy of AI wrapper software for IP (IPKat: Is AI software for IP just expensive wrapping paper?). There is not much difference these days between the output of such tools, and the output of a foundational LLM such as ChatGPT, Gemini or Claude combined with prompt engineering by an experienced patent attorney. In many cases, the content of the output is worse, as it will lack specific technical field and jurisdictional expertise. However, a strong argument for using a wrapper was that they do generally provide a user-friendly interface with, for example, the ability to combine prompting and track changes. 

That all changed last weekend, with Anthropic's release of the Claude plug-in for Word. The Claude plug-in (which, interestingly, appears to be marketed at lawyers specifically), allows Claude users to prompt within Word, incorporating tracked-changes functionality. Word is, by all accounts, a horrible piece of clunky software to deal with, and it is notable that Microsoft themselves haven't yet worked out how to combine CoPilot prompting and tracked changes in a useable way. Whilst tracked changes and version control combined with prompting has been available for ages for code, for text editing some people were even predicting a shift away from Word to a markdown editors or LaTeX editors. Claude has however, clearly recognised the importance of Word integration as a bottle-neck for AI adoption, and thrown everything at providing their own plug-in. As with all AI use, the Claude plug-in uses Claude itself, and therefore users need to have the appropriate confidential provisions in place (IPKat). 

Final thoughts

It is clear from the benchmark data that, not only are LLMs now very good, but their abilities are also increasing very rapidly. In this Kat's view, the launch of the Claude Word plug-in removes one of the few arguments remaining for relying on AI wrapper software instead of upskilling ourselves as attorneys to be able to use AI effectively. Whilst the proud dinosaurs may be content to wait out the rise of AI until they can take early retirement, those of us who are enthusiastic about the future of the profession should view LLMs as an opportunity for learning something new. To do this, we need to be learning how to use AI to enhance our own expert capabilities, and not relying on someone else's. 

Acknowledgements: Thanks, as always, to Mr PatKat (Laurence Aitchison, Head of Reasoning at Mistral) for his invaluable AI-industry insights.

Further reading

AI in the patent industry: Don't believe the hype. Believe the data. AI in the patent industry: Don't believe the hype. Believe the data. Reviewed by Dr Rose Hughes on Wednesday, April 15, 2026 Rating: 5

8 comments:

  1. As to the burgeoning capabilities of LLM's, if one goes to the EPO Register in the G1/25 case, one finds a new (10 April 2026) 22 page third party observation coming from "Oliver Farnsworth". As far as I can determine, this is a pseudonym. But is this interested "third party" also an AI?

    Take a look at its 22 page work product and ask yourself: skilfully instructed, could an AI have written all that? I have no idea.

    But if it were an AI, what wrote it, well then it is pretty impressive, don't you think?

    Otherwise, Oliver cites the IPKat in his learned submission so presumably will read this thread. Are you out there, Oliver? What do you say?

    ReplyDelete
  2. Interesting data for sure, but there is also data about a topic which I think isn't currently being discussed enough: economics.

    If you listen to Ed Zitron or other critics of the "AI bubble" like him, you find out that OpenAI and Anthropic currently spend anywhere between 3 US dollars to 15 US dollars for every 1 US dollar they earn from subscription revenues. This means that their operations are being subsidized on an incredible scale by their investors.

    Which will have to stop at some point, because even very-well funded investors do not have infinite amounts of capital.

    Thus, eventually, OpenAI and Anthropic will have a problem: 1/ find a way to earn more revenue and/or bring costs down, or 2/ face bankruptcy.

    Except that 1/ could very well lead to 2/, because raising prices by 3x, 10x or 15x does not exactly help with customer retention, and selling a worse product for the same price does not either.

    Even worse, it appears that the newest models are more capable largely because they use more tokens. (Zitron quotes figures of "over a trillion tokens in a single week" for Anthropic's Claude Sonnet 4.5.) And token usage translates to cost. Cost which is largely hidden from the end user because, see above, AI providers are subsidized by their investors.

    I suspect that most AI providers will face the same problem sooner or later. When they do, how many will survive the economic reckoning? And after the dust has settled, how many AI tools will still be on the market, and at what price?

    I do not have the answer to either of these questions, but I think no one should assume that all currently available AI tools will still be on the market in a few years, and at the same price plus a little inflation adjustment.

    ReplyDelete
  3. I think one of the biggest blockers in our profession is the assumption that AI must be able to do all of a patent attorney’s job to be useful and in my opinion, that’s the wrong benchmark. At this stage, AI is best understood as an augment. It accelerates reading, synthesis, first‑pass analysis and drafting, whilst judgment, context and responsibility remain firmly human. Used that way, the question isn’t whether AI can replace us, but whether we’re willing to rethink how we work and where our real value lies.

    ReplyDelete
  4. Oliver Farnsworth is the name of the patent attorney in The Man Who Fell to Earth, so, yes, it's a pseudonym.

    ReplyDelete
  5. Private practice here. AI currently gives me efficiency with one hand, and takes away efficiency with the other.

    Specifically, every bit of time I save myself through using AI, more of my time is wasted by clients’ misuse of AI to communicate with me.

    How good state of the art AI becomes is a challenge for another day. The big challenge right now is dealing with ‘slop communication’, and ‘de-radicalising’ clients/enquiries because a free model has grossly misled them.

    ReplyDelete
  6. I'm not hopeful that models are going to get better on patent-related reasoning tasks. "Better" is an ill-defined, non-verifiable moving target for us (and legal advice in general). In all the benchmark examples given in the article, each test has a right and wrong answer that can be programmatically checked for correctness, or judged by another LLM to score it.

    As anyone who has received LLM-written client feedback on a draft or response proposal will tell you: even the latest Claude and Gemini models are awful at judging what is good and what is bad.

    So problem 1 is a good patent benchmark is impossible to create, and, even if it was possible, problem 2 is there are no good judge models to reliably assess improvement or regression on such a patent benchmark.

    You can't make a model better at something you can't define, quantify or judge at scale.




    ReplyDelete
    Replies
    1. "You can't make a model better at something you can't define, quantify or judge at scale." Someone should tell the PEB and EQE examiners...

      Delete
  7. I'd enjoy watching an AI attempt to pass FD4

    ReplyDelete

All comments must be moderated by a member of the IPKat team before they appear on the blog. Comments will not be allowed if the contravene the IPKat policy that readers' comments should not be obscene or defamatory; they should not consist of ad hominem attacks on members of the blog team or other comment-posters and they should make a constructive contribution to the discussion of the post on which they purport to comment.

It is also the IPKat policy that comments should not be made completely anonymously, and users should use a consistent name or pseudonym (which should not itself be defamatory or obscene, or that of another real person), either in the "identity" field, or at the beginning of the comment. Current practice is to, however, allow a limited number of comments that contravene this policy, provided that the comment has a high degree of relevance and the comment chain does not become too difficult to follow.

Learn more here: http://ipkitten.blogspot.com/p/want-to-complain.html

Powered by Blogger.