Following my post earlier this week about the benchmark report published by Blue Hill Research that assessed the ROSS Intelligence legal research platform, I had several questions about the report and many readers contacted me with questions of their own. The author of the report, David Houlihan, principal analyst at Blue Hill, kindly agreed to answer these questions.
The study assigned researchers to four groups, one using Boolean search on either Westlaw or LexisNexis, a second using natural language search on either Westlaw or LexisNexis, a third using ROSS and Boolean search, and fourth using ROSS and natural language search. Why did none of the groups use ROSS alone?
Houlihan: Initially, we did plan to include a “ROSS alone” group, but cut it before starting the study. We did this for two primary reasons. One: the study was relatively modest and we wanted to keep our scope manageable. Focusing on one use case (ROSS combined with another tool) was one way to do that. Two: I don’t think an examination of “ROSS alone” is particularly valuable at this time. AI-enabled research tools are in early stages of technological maturity, adoption, and use. ROSS, for example, only provides options for specialized research areas (such as bankruptcy), which means assessing it as a replacement option for Westlaw or Lexis is premature. Instead, we focused our research on the use case with the currently viable value proposition. That said, I have no doubt that there will need to be examinations of the exclusive use of AI-enabled tools over time.
The report said that you used experienced legal researchers, but it also said that they had no experience in their assigned research platforms. How is it possible for an experienced legal researcher to have no experience in Westlaw or LexisNexis? Did you have Westlaw users assigned to Lexis, and vice versa?
Houlihan: You have it. Participants were not familiar with the particular platforms that they used. They were proficient in standard research methods and techniques, but we intentionally assigned them to unfamiliar tools. So, as you say, an experienced Westlaw user could be put on LexisNexis, but not Westlaw. The goal was to minimize any special advantage that a power user might have with a system and approximate the experiences of a new user. I think readers of the report should bear that in mind. I expect different results if you were to look at the performance of the tools with users with other levels of experience. That’s another area that deserves additional investigation.
The research questions modeled real-world issues in federal bankruptcy law, but you chose researchers with minimal experience in that area of law. Why did you choose researchers with no familiarity in bankruptcy law?
Houlihan: In part, for similar reasons that we assigned tools based on lack of familiarity. We were attempting to ascertain the experiences of an established practitioner that was tackling these particular types of research problems for the first time as a base line.
Moreover, introducing participants with bankruptcy experience and knowledge adds some insidious challenges. You cannot know whether your participants’ existing knowledge is affecting the research process. You also need to figure out what experience level you do wish to use and how to ensure that all of your participants are operating at that level. Selecting participants that were unfamiliar with bankruptcy law eliminated those worries. Although, again, a comparison of the various tools at different levels of practitioner expertise would be a study I would like to see.
I would think that the bankruptcy libraries on ROSS, Westlaw and LexisNexis do not mirror each other. Given this, were researchers all working from the same data set or were they using whatever data was available on whatever platform?
Houlihan: Researchers were limited to searches of case law, but otherwise they were free to use the respective libraries of the tools as they found them. It strikes me as somewhat artificial to try to use identical data sets for a benchmark study like this. If we were conducting a pure technological bake-off of the search capabilities of the tools, I think that identical data sets would be the right choice. However, that’s not quite what Blue Hill is after. As a firm, we try to understand the potential business impact of a technology, based on what we can observe in real-world uses (or their close approximations). To get there, I would argue that you need to account for the inherent differences that users will encounter with the tools.
With regard to researchers’ confidence in their results, wouldn’t the use of multiple platforms always enhance confidence? In other words, if I get a result just using Lexis or get a result using both Lexis and ROSS, would the second situation provide more confidence in the result because of the confirmation of the results? And if so, would it matter if the second platform was ROSS or anything else?
Houlihan: I think that’s right, but we weren’t trying to show anything more profound. For the users of ROSS in combination with a traditional tool, we saw higher confidence and satisfaction than users of just one of those traditional tools with a great deal of consistency.
Whether it is always true that the use of two types of tools, such as Boolean and Natural Language, will yield the same response, I can’t say. We didn’t include that use case. As one of your readers rightfully pointed out, the omission is a limitation with the study. That is yet another area where more research is needed. I fear I am repeating myself too much, but the technology is new and the scope of what needs to be assessed is not trivial. It is certainly larger than what we could have hoped to cover with our one study.
For what it is worth: I wondered at the outset whether two tools would erode confidence. I still do. We tended to see fairly different sets of results returned from different tools. For example, there were a number of relevant cases that consistently appeared in the top results of one tool that did not appear as easily in another tool. To my mind, that undermines confidence, since it encourages me to ask what else I missed. That reaction was not shared by our participants, however.
With respect to the groups assigned to use ROSS and another tool, did you measure how much (or how) they used one or the other?
Houlihan: We did, but we opted to not report on it. The relative use of one tool or another varied between researchers. As a group, we did observe that participants tended to rely more on the alternative tool for the initial questions and to increase their reliance on ROSS over the course of the study. I believe we make a note about it in the report. However, we did not find that this was a sufficiently strong or significant trend to warrant any deeper commentary without more study.
(This question comes from a comment to the original post.) It appears that the Westlaw and Lexis results are combined in the “natural language” category. That causes me to wonder if one or the other exceeded ROSS in its results and they were combined to obscure that.
Houlihan: The reason we combined the tools was because we never intended to compare Westlaw v. ROSS or Lexis v. ROSS. We were interested in how our use case compared to traditional technology types used in legal research. We used both Lexis & Westlaw within each assessment group to try to get a merged view of the technology type that wasn’t overly colored by the idiosyncrasies that the particular design of a tool might bring. In fact, we debated whether to mention that Westlaw or LexisNexis tools were used in the study at all. Ultimately, we identified them as a sign that we were comparing our use case to commonly used versions of those technology types. As for how individual tools performed, all I feel with can say reliably is that we did not observe any significant variation in outcomes for different tools of the same type.
A huge thanks to David Houlihan for taking the time to answer these. The full report can be downloaded from the ROSS Intelligence website.