Language Log

Supreme Court open infrastructure

December 4, 2009 @ 9:52 am · Filed by Mark Liberman under Computational linguistics

Yesterday and today, I'm at Washington University in St. Louis at a meeting on open infrastructure for studies of the U.S. Supreme Court, organized by Andrew Martin at the Center for Empirical Research in the Law. (That sentence sets some kind of local record for prepositional phrase density, but a couple of quick attempts to fix it made things worse. Just to start with, you've got CERL, which has two, and WUSL, which adds one more…)

Andrew is one of the principals of the Supreme Court Database Project. Other participants in the meeting include Jerry Goldman of Oyez, Wayne McIntosh of the Digital Docket project, Sarah Frug of the Cornell Legal Information Institute, Daniel Ho from Stanford, and Mike Bommarito and Dan Katz of the Computational Legal Studies blog.

That's all I have time for this morning, but I invite you to explore this area a bit by following some of those links. For example, recent posts at Computational Legal Studies will lead you to static and dynamic visualizations of the East Anglia Climate Research Unit leaked email network (how's that for a perfectly comprehensible 8-element complex nominal?), and to a post on Distance Measures for Dynamic Citation Networks, with an application to community structure in the early Supreme Court, and an associated paper:

Acyclic digraphs arise in many natural and artificial processes. Among the broader set, dynamic citation networks represent a substantively important form of acyclic digraphs. For example, the study of such networks includes the spread of ideas through academic citations, the spread of innovation through patent citations, and the development of precedent in common law systems. The specific dynamics that produce such acyclic digraphs not only differentiate them from other classes of graphs, but also provide guidance for the development of meaningful distance measures. In this article, we develop and apply our sink distance measure together with the single-linkage hierarchical clustering algorithm to both a two-dimensional directed preferential attachment model as well as empirical data drawn from the first quarter century of decisions of the United States Supreme Court. Despite applying the simplest combination of distance measures and clustering algorithms, analysis reveals that more accurate and more interpretable clusterings are produced by this scheme.

The Digital Docket's web site will lead you to its publications page. The Supreme Court Database Project offer some nifty online analytic possiblities, as well as the option to download all of its data in various convenient formats. And so on…

December 4, 2009 @ 9:52 am · Filed by Mark Liberman under Computational linguistics

Permalink

30 Comments

Dan Lufkin said,

December 4, 2009 @ 11:17 am

Lemme just quickly comment on the East Anglia Climate Research Unit leaked email network flap (C.P Snow's Two Cultures Dept.) — In one of the hacked e-mails, Phil Jones, head of the CRU, says, about a bad paper printed in a pseudo-journal, "… rid themselves of that troublesome editor."

I haven't found one single hit in the megawords that have been posted on climate blogs that picks up on the reference to Thomas á Becket. In a class I was teaching last week, one student in 16 (mostly engineers) got it.

Dunno how distance measures would handle this, but I find it depressing.
Vance Maverick said,

December 4, 2009 @ 11:59 am

I don't think academic citation graphs are really acyclic. Certainly during my own years in academe, it often happened that two papers were written in parallel (whether by the same group, or two groups in regular contact), and on publication, each had a reference to the other. (Email citation graphs, yes, barring time travel.)
Faldone said,

December 4, 2009 @ 12:09 pm

Did you count the missing preposition between I'm and Washington?
slobone said,

December 4, 2009 @ 12:42 pm

@Vance Maverick, Yeah, I was wondering about that. I thought it would be pretty funny if citation digraphs turned out to have cycles, but I guess there are legitimate ways it could happen…
Vance Maverick said,

December 4, 2009 @ 1:04 pm

The trick is for each paper to be accepted, so that a full citation is possible, before the other is finalized.
Philip TAYLOR said,

December 4, 2009 @ 1:27 pm

OK, but then what is the correct tense for mutually-recursive citations ? You can't write "x, y and z have shewn … [XYZ2009]", because they haven't (yet); they are about to; yet "x, y and z will show … [XYZ2009]" suggests quite remarkable precognition ! I suppose that "x, y and z show … [XYZ2009]" avoids all of the problems, but is it used ?
John Cowan said,

December 4, 2009 @ 1:33 pm

I presume you dropped one of your prepositions: for "I'm" read "I'm at".
Brett said,

December 4, 2009 @ 1:41 pm

In my field, theoretical physics, where almost all papers are posted online months in advance of their publication, recursive citations have become much more frequent in recent years (although they are still, in an absolute sense, quite uncommon). I have encountered it myself; I cited a paper that was available on arXiv.org (but not yet published) in a paper of my own. My paper was promptly published, and by the time the other paper finally was, it had been updated to include a citation to my work. Since all the citations are to pre-existing work (rather than upcoming or simultaneous work), there is no issue of tense. However, I must admit to feeling a weirdness to the whole thing. My own citation cycle was not a case of paper A relying on paper B relying on paper A… (we were merely acknowledging each other's work on different aspects of the same problem) but the possibility of that kind of infinite regress I find disconcerting.
Dan Holden said,

December 4, 2009 @ 2:41 pm

The acronym for WashU, at least locally and on the student radio station, is WUSTL.
Simon Cauchi said,

December 4, 2009 @ 3:30 pm

Please someone: explain the meaning of "acyclic digraphs".
Philip TAYLOR said,

December 4, 2009 @ 3:36 pm

Acyclic : contains no cycles (doesn't loop back on itself)
Digraph : portmanteau word, "directed graph", a graph (as in graph theory, as opposed to as in graph paper) in which the direction of travel matters.

Caveat emptor : I am not a mathematician.
Simon Cauchi said,

December 4, 2009 @ 3:36 pm

On second thoughts, please don't! Wikipedia will do.
Simon Cauchi said,

December 4, 2009 @ 3:43 pm

What a stupid portmanteau word! "Digraph" doesn't mean, or in my view shouldn't be used to mean, "directed graph". It means representing one sound by two characters, e.g. "ph" representing [f], or (in typography) a ligature combining two characters, e.g. ct or fi.
Simon Cauchi said,

December 4, 2009 @ 3:46 pm

Philip Taylor: you have good precedent for your "shewn". It was Bernard Shaw's choice of spelling, too.
Jem said,

December 4, 2009 @ 3:53 pm

@Simon: Well, with your bog-standard graph, you've got a set of vertices and a set of edges, with each edge touching two vertices, but we don't think of the edges as going in any particular direction; an edge between A and B is also an edge between B and A.

A digraph (short for directed graph) is like a graph, but the edges are only one-way connections. So if you have an edge from A to B, you might or might not also have a separate edge from B to A. These are handy for modeling things like citation networks, where the papers are the vertices and an edge from paper X to paper Y is a citation of paper X by paper Y.

A cycle in a digraph is a sequence of edges that leads back to its starting point, like A->B->A or A->B->C->D->A. An acyclic digraph.

There are interesting facts that are true about acyclic digraphs that aren't true about digraphs in general; for instance, if (and only if) a digraph is acyclic, we can assign each vertex a number in such a way that all the edges lead from a smaller number to a larger.

Because of this property, citation networks are usually modeled as being acyclic digraphs (imagine assigning each paper a number representing its publication time and apply the fact mentioned in the last paragraph). As the comments have mentioned, it's not quite true that real-world citation networks are acyclic, because the concept of "publication time" can be a bit fuzzy.
Jem said,

December 4, 2009 @ 3:55 pm

@Simon: I didn't see your "second thoughts" until I typed that and hit submit to reload…I don't mind, though! The question/answer interaction on an Internet discussion forum is an enjoyable social interaction, even if it isn't necessary or useful in any relevant sense.
Simon Spero said,

December 4, 2009 @ 4:17 pm

This article raises an interesting question. If there is an area of study devoted to investigating the process of science through empirical means, and teams from outside that area attempt to perform studies of their own disciplines, might the latter group of scientists study and cite potentially relevant literature, or could there be potential duplications?

It's too late for breakfast experiments, so I can't make a name pun, but: Egghe and Rousseau (2002) studied "directed, acyclic graphs", and introduced "head and tail order relations and stud[ied] some of their properties". They generalised forms of the Jaquard similarity measure.

One measure of similarity they present is the ratio of the number of shared indirect citations to the total number of distinct indirect citations.

Bommarito, Katz, Zelner & Fowler (2009), cited in the original post, propose a distance metric that "bear[s] some similarity to the Jaccard similarity measure, as they involve intersections in the numerator and unions in the denominator", but unlike "the standard Jaccard similarity index" takes into account reachable nodes at further distances than those immediately adjacent.

As a professor at Penn, would you call this Garfield minus Garfield?

End of rant
——-

Egghe, Leo and Ronald Rousseau (2002). “Co-citation, bibliographic coupling and a characterization of lattice citation networks”. In: Scientometrics 55.3 (Nov. 2002). Pp. 349–361. URL: http://dx.doi.org/10.1023/A:1020458612014.

Abstract:
In this article we study directed, acyclic graphs. We introduce the head and tail order relations and study some of their properties. Recalling the notions of generalized bibliographic coupling and generalized co-citation, and introducing a new property, called the l – property, we come to a characterization of lattices. As document citation networks are concrete realizations of directed acyclic graphs all our results are directly applicable to citation analysis.
Simon Cauchi said,

December 4, 2009 @ 5:02 pm

@Jem: Good to read your explanation anyway. Thanks.
Faldone said,

December 4, 2009 @ 5:22 pm

John Cowan: I presume you dropped one of your prepositions: for "I'm" read "I'm at".

I think he has a FIFO preposition stack that got filled up; the first one popped off into the preposition bucket.
Peter Taylor said,

December 4, 2009 @ 5:36 pm

the East Anglia Climate Research Unit leaked email network (how's that for a perfectly comprehensible 8-element complex nominal?)

Not quite there. I found it ambiguous (a network of emails seems the natural parse, but what would that mean?), and would prefer to make it 9-element: the East Anglia Climate Research Unit leaked email participant network.
George Amis said,

December 4, 2009 @ 6:22 pm

@Dan Lufkin, from the Dept. of Trivial Corrections

If you're not going to call him simply Thomas Becket, it's Thomas à Becket, not Thomas á Becket.
Spectre-7 said,

December 4, 2009 @ 6:58 pm

Caveat emptor : I am not a mathematician.

Funny. I always thought that meant buyer beware. ;)
John Cowan said,

December 4, 2009 @ 8:32 pm

I call him Thomas Bouquet myself.
Charles Belov said,

December 5, 2009 @ 4:05 am

Darn, fixed this very minute. There goes my theory "I'm Washington University" was an overblown case of metonymy. :P
Nathan Myers said,

December 5, 2009 @ 5:28 am

There's a trick for dealing with almost-acyclic graphs. (The "di-" is redundant when you say "acyclic", because it means nothing except in the context of directed graphs.) You bundle up all the nodes in a cycle and treat them, as much as possible, as one node. It's not very pleasant to code around, but it's better than giving up all the nice acyclic-graph properties.

This use of "trick" is the same as in the CRU e-mails, and is equally innocent.
Dan Lufkin said,

December 5, 2009 @ 11:38 am

The subject of digraphs leads off to such entertaining topics as non-transitive dice and rock-paper-scissors, suitable for whiling away a snowy Saturday.

Sorry about old Tom Becket — must have had my keyboard skew-whiff. I'm acutely aware of a grave mistake, or words to that effect.
Garrett Wollman said,

December 5, 2009 @ 4:49 pm

The other way of dealing with a not-quite-DAG is to remove edges until all cycles disappear. In the case of citation graphs, for example, one can remove edges where the source node was published before the target node. It's been a long time since I studied graph theory so I don't recall what the complexity of the minimal edge-removal problem is.
Kenny Easwaran said,

December 7, 2009 @ 7:01 pm

I was momentarily confused by "acyclic digraph" because the terminology I've always heard was "directed acyclic graphs" or "DAGs". At least, that's the terminology the causal Bayes net people use. "Digraph" just always suggests to me that you might have arrows going both ways between a given pair, which contradicts the "acyclic" part.

(Also, whoever said that "acyclic" only makes sense for digraphs is wrong. Of course, we don't normally use that word for non-directed graphs because we can use the word "tree" for each connected component of such a graph, and "forest" for the whole thing.)
Robert Richards said,

December 11, 2009 @ 1:38 pm

Professor Liberman: Thanks very much for publicizing this meeting and this very interesting new project. Would you be willing to provide more details about the discussion that took place during the meeting, and any future steps or plans that participants committed to during the meeting? I'd like to write a blog post at Legal Informatics Blog providing some details about the meeting and identifying future steps or plans that project participants have committed to pursue. I think there is considerable interest about this project among a wide range of developers and scholars in several disciplines. Thanks for considering this.
Supreme Court Open Infrastructure Project « Legal Informatics Blog said,

December 13, 2009 @ 7:35 pm

[…] A meeting — attended by personnel from Northwestern University's Oyez Project, the University of Maryland's Digital Docket Project, the Computational Legal Studies blog, Cornell's Legal Information Institute, Stanford Law School, the University of Pennsylvania's Linguistic Data Consortium, and Washington University’s Center for Empirical Research in the Law (CERL) — to discuss plans for the project took place on 3-4 December 2009, at CERL. Reports on the meeting are available from Daniel Martin Katz and Mark Liberman. […]

RSS feed for comments on this post

Supreme Court open infrastructure

30 Comments

Dan Lufkin said,

Vance Maverick said,

Faldone said,

slobone said,

Vance Maverick said,

Philip TAYLOR said,

John Cowan said,

Brett said,

Dan Holden said,

Simon Cauchi said,

Philip TAYLOR said,

Simon Cauchi said,

Simon Cauchi said,

Simon Cauchi said,

Jem said,

Jem said,

Simon Spero said,

Simon Cauchi said,

Faldone said,

Peter Taylor said,

George Amis said,

Spectre-7 said,

John Cowan said,

Charles Belov said,

Nathan Myers said,

Dan Lufkin said,

Garrett Wollman said,

Kenny Easwaran said,

Robert Richards said,

Supreme Court Open Infrastructure Project « Legal Informatics Blog said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta