\section{Methods}
\label{sec:arch}
We adapted the co-training algorithm ({\em DL-CoTrain}) presented in
\newcite{ref:collins99unlabeled} to learn a decision list (DL) given examples
from only one class using co-training. We start with a review of DL-CoTrain and
then describe PMI-CoTrain, our mutual-information-based modification for
one-class problems.
\subsection{Multiple Class Co-training (DL-CoTrain)}
\label{subsec:dlcotrain}
For this algorithm, we represent an instance $\mathbf{x}$ by the set of its
binary features $\mathbf{x}=\{x_1,\cdots,x_k\}\subseteq \mathcal{X}$ where
$\mathcal{X}$ is the set of possible features; class labels are drawn from a
finite set $\mathcal{Y}$. Then a decision list is a function $d : \mathcal{X}
\times \mathcal{Y} \rightarrow [0,1]$ that maps a feature and a label to a
confidence value $d(x,y)$. In other terms, $d$ represents a set of decision
rules $x\rightarrow y$ with weights $d(x,y)$. The decision list $d$ can then be
used to label an instance as follows:\[ \hat{y} = \arg \max_{x \in \mathbf{x}, y \in \mathcal{Y}} d(x,y) \]
That is, the instance gets the label for the decision rule in $d$ with the highest weight.
The algorithm learns by inducing two decision lists, one for features based
on the \emph{content} of an entity mention (typically spelling features) and
the other for features based on the \emph{context} in which the entity mention
occurs. Content features might include {\em contains(Bush)} or
{\em lowercase-starts-with(b)}; context features might include {\em
lexical-context("the president")} for the instance {\em ``George Bush, the president
of United States''}.
Starting from seed set of content decision rules for each class, the entire
corpus is labeled. In this process, some of the examples with no matching
decision rule may be left unlabeled. Using the current set of labeled examples,
possible context features $x$ are induced where the weight for a candidate
context rule with that feature is:
\begin{equation}
d(x,y) = \frac{\mathrm{Count}(x,y) + \alpha}{\sum_y (\mathrm{Count}(x,y) + \alpha)}
\label{eqn:hxy}
\end{equation}
At each iteration a maximum of $n=5$ rules with confidence value greater than
$p_{min}$ are added to the corresponding decision list for each class. Now that
context rules have been induced, the corpus is relabeled using context rules to
induce new content rules. This process is repeated until a maximum of
$N_{\mathrm{max}}=2500$ rules are gathered. In this manner, co-training
utilizes the redundancy in natural language to produce a largely unsupervised
classifier.
\subsection{Single Class Co-training (PMI-CoTrain)}
\label{subsec:pmi_co_train}
In practice, the DL-CoTrain method performs well when there are a fixed, small
number of class labels. However, for attribute extraction there are many
different attributes that are applicable to different entity types. Providing
seeds for each attribute would be impractical, especially since we want our
methods to discover attributes with minimal supervision.
Instead, we propose the use of relation-specific positive seeds to bootstrap
the co-training algorithm. For DL-CoTrain algorithm with just one label
the equation (\ref{eqn:hxy}) is uniformly one. To alleviate this problem we
split our feature set into two redundant views ({\it spelling} and {\it
context}) $\mathcal{X}_1 \times \mathcal{X}_2$ and score features in the two
sets based on pointwise mutual information. The confidence estimation scheme is
similar to the one outlined in \cite{ref:espresso}.
\begin{equation}
Conf(x_1) = \frac{\sum_{x_2 \in \mathcal{X}_2}(\frac{pmi(x_1,x_2)}{max_{pmi}} \star Conf(x_2))}{\mid \mathcal{X}_2 \mid}
\label{eqn:r-key-k}
\end{equation}
\begin{equation}
Conf(x_2) = \frac{\sum_{x_1 \in \mathcal{X}_1}(\frac{pmi(x_1,x_2)}{max_{pmi}} \star Conf(x_1))}{\mid \mathcal{X}_1 \mid}
\label{eqn:r-val-v}
\end{equation}
where $Conf(x_1)$ is the confidence of feature $x_1$, $Conf(x_2)$ is the
confidence of feature $x_2$, $pmi(x_1,x_2)$ is the pointwise mutual information
(pmi) \cite{ref:cover-thomas} between features $x_1$ and $x_2$. $max_{pmi}$ is
the maximum pmi between all features $x_1 \in \mathcal{X}_1$ and $x_2 \in
\mathcal{X}_2$. It is worth nothing that $Conf(x_1)$ and $Conf(x_2)$ are
recursively defined. $Conf(x_1)=1.0$ for the manually supplied seed features.
The pointwise mutual information is then calculated for two features $x_1 \in
\mathcal{X}_1$ and $x_2 \in \mathcal{X}_2$ as follows,
\begin{eqnarray*}
pmi(x_1,x_2) &=& DF \star \log\frac{P(x_1,x_2)}{P(x_1) P(x_2)}\\
&=& DF \star \log\frac{\mid x_1,x_2 \mid \mid *,* \mid}{\mid x_1,* \mid \mid *,x_2 \mid}
\label{eqn:pmi}
\end{eqnarray*}
where $\mid x_1, x_2 \mid$, $\mid x_1, * \mid$, $\mid *, x_2 \mid$ and $\mid *,
* \mid$ are the counts of feature co-occurences for $x_1$ with $x_2$, $x_1$
with any $x'_2 \in \mathcal{X}_2$, $x_2$ with any from $x'_1 \in \mathcal{X}_1$
and any $x'_1 \in \mathcal{X}_1$ with $x'_2 \in \mathcal{X}_2$ respectively. A
discount factor ($DF$), as suggested by \newcite{ref:pantel-ravi04}, may be
used to prevent low-count features from getting a higher rank. To get the score
of a particular feature pair $(x_1,x_2)$ we use,
\begin{equation}
Conf_{PT}(x_1,x_2) = Conf_{PT}(x_1) \star Conf_{PT}(x_2)
\label{eqn:pair-score}
\end{equation}
The PMI-CoTrain method proceeds in a similar manner to the DL-CoTrain algorithm
by going back and forth between content and context features. The confidence
values in equations (\ref{eqn:r-key-k}) and (\ref{eqn:r-val-v}) are used in
place of the probability estimate in equation (\ref{eqn:hxy}). The method
allows us to use arbitrary features in the redundant views unlike earlier
methods~\cite{ref:espresso} where the entire context or spelling have to be
used. However, care needs to be taken that the features generated are neither
too specific (lower recall) nor too general (lower precision). Our experiments below
suggest that this is possible. In such a scheme, either a fixed
number of iterations are used or the iterations are continued until confidence
estimates drop below certain threshold.
\subsection{Extraction Algorithm}
\label{subsec:algorithm}
Our extraction algorithm proceeds in two stages, both stages utilizing the PMI-CoTrain method described above:
\begin{enumerate}
\item $\mathit{Patterns} \Join \mathit{Tuples}$ extraction: First, a set of instances are labeled using the seeds and the confidence of the extracted tuple features is set to $0.9999$. The PMI-CoTrain algorithm is run on the feature set which is divided into features based on patterns and those based on tuples. The confidence values of these features are updated based on Equations \ref{eqn:r-key-k} and \ref{eqn:r-val-v}. The algorithm is run for a fixed number of iterations $N$ and finally the tuples and instances with the highest confidence scores are outputted.
% Alternatively, the algorithm described in \cite{ref:espresso} can be used.
\item $\mathit{Keys} \Join \mathit{Values}$ re-ranking (\KV-ranking): This stage takes in as input the set of tuples extracted by stage 1 and outputs a list of tuples sorted based on scores computed using equation (\ref{eqn:tpl-score}). Here the two feature sets utilize the keys and values of the $(entity,attribute)$ tuple respectively. For example, ({\em IBM}, {\em employees}) is an instance of the relation ({\em Company}, {\em attribute}). In this case, {\em IBM} is the {\em key} and {\em employees} is the {\em value}. The confidence of a particular entity-attribute pair is modified from equation (\ref{eqn:pair-score}) to be,
\begin{eqnarray}
Conf_{KV}(x_1,x_2) = Conf_{PT}(x_1, x_2) \star \nonumber\\
Conf_{KV}(x_1) \star Conf_{KV}(x_2)
\label{eqn:tpl-score}
\end{eqnarray}
%% where
%% $I(k,v) = \left\{ \begin{array}{ll}
%% 1 & if (k,v) \in extracted\_tuples \\
%% -\infty & o.w.
%% \end{array}
%% \right. $
We use Equations \ref{eqn:r-key-k} and \ref{eqn:r-val-v} to estimate $Conf_{KV}(x1)$ and $Conf_{KV}(x2)$, respectively. Analogous to the $\mathit{Patterns} \Join \mathit{Tuples}$ view, the intuition here is to have increased confidence in a {\em value} which is associated with many reliable {\em keys} and its association with such {\em keys} is high. In the experiments reported in this paper, we do not use bootstrapping during \KV-ranking and instead use a fixed number of iterations. Since {\em KV}-ranking generates separate confidences for {\em keys} and {\em values}, ranked {\em key} and {\em value} lists are important by-products of {\em KV}-ranking.
\end{enumerate}
The final output of the algorithm are tuples which are extracted and re-ranked
using PMI-CoTrain.