Release Notes PAN-11 AUTHORSHIP TRACK CORPUS This is the preliminary release corpus for the PAN-11 authorship attribution track. This track comprises 7 tests for 5 different training sets. The training sets are: Name Number of Authors Number of Documents -------------------------------------------------- Large 72 9337 Small 26 3001 Verify1 1 42 Verify2 1 55 Verify3 1 47 For each of the Large and Small training sets, there are two tests, one only containing authors in the training set, and one containing also around 20 other out-of-training authors. All of the verification test sets inherently include out-of-training authors. In this preliminary release, you are given example test sets, called "Valid" (for validation) sets; in each case, if the test set contains out-of-training authors, the name ends in a +. For the files provided in this package, the statistics are: Name Number of Authors Number of Documents ------------------------------------------------------- LargeValid 66 1298 LargeValid+ 86 1440 SmallValid 23 518 SmallValid+ 43 601 Verify1Valid+ 24 104 Verify2Valid+ 21 95 Verify2Valid+ 23 100 For each of these a testing file, not containing author IDs, is given as well as a ground truth file, containing the actual author IDs for the texts, for validation purposes. Note that apart from name redaction (as mentioned below), the texts are intended to reflect a natural task environment, and so there are some texts, both in training and in testing sets, that are not in English, or that are automatically generated. You need not give an answer for all test documents, which may reduce recall, but may increase your precision. Each of the files is in an XML format, with similar schemas, as follows. The training files look like: TEXT OF THE MESSAGE ... Testing files look like: TEXT OF THE MESSAGE ... And the ground truth files look like: ... Submitted results must be in the ground truth file format for evaluation. Most personal names and email addresses have been (automatically) redacted, and replaced (on a token-by-token basis) by and tags, respectively. This redaction is admittedly imperfect, but we do not recommend relying on its imperfections. Other than this redaction, each text is typographically identical to the original electronic text, so you can, in principle, rely on line length, punctuation, and the like. Please contact Shlomo Argamon via our internal mailing list pan@webis.de or via our public mailing list pan-workshop-series@googlegroups.com if you have any questions.