Following a discussion with Robert, I have started working on a viewer
application intended to make viewing and judgment of corpora and topics
as easy as possible. The intention is to make this development as rapid
as it can possibly be. I'm building this with .NET (NHibernate / ASP.NET
Following are several remarks / high-level description. I'm interested
in capturing some early feedback and ideas, but please note my intention
is to start with something functional first.
While FILEFORMATS.txt defines file structures, since the viewer is
working against a DB those will only be honored via export functions.
See attached image for a domain model.
A corpus DB entry points to a FS path (could also be remote via HTTP for
example). The viewer, in turn, will load the files one by one and the
judgment will be saved with the Corpus ID, Topic ID and a string
representation of the document filename. The former 2 are integers, and
document ID is defined as a string, so document file-names can use a
base-24 ID representation for generated corpora (i.e. exporting from a
Unlike what was stated in FILEFORMATS.TXT, a corpus will not reside in a
The above approach may allow for more than one people judging the same
document for the same topic at once - which is bad since it could waste
the users time (no need for double-judgment). I'll probably have to
resolve this by implementing a HiLo-like mechanism (or pooling), but I'm
leaving this for later.
The web application will allow for submitting new topics per language,
and to judge documents for a topic. The Judgment screen will show the
topic at top, navigation at left, and the document in rest of the
screen. The user can choose "Relevant", "Irrelevant", "Skip".
A user can filter by language, so he sees only topics relevant to him.
Language filtering can be applied using a language string ("en-US") per
topic and corpus.
Thats about it for now, looking forward to some feedback.