\documentclass[titlepage,11pt]{amsart}
\oddsidemargin 0.0in
\evensidemargin 0.0in
\textwidth 6.5in
\usepackage{lettrine}
\usepackage{listings}
\usepackage{cite}
\usepackage{url}
%\usepackage[T1]{fontenc}
\usepackage[pdftex]{graphicx}
\graphicspath{{./images/}}
\DeclareGraphicsExtensions{.png,.jpg}

% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor}

% Handy macros.
\newcommand{\PR}{\emph{PackRat} }
\newcommand{\MF}{microformat}
\newcommand{\MFs}{microformats}
\newcommand{\Op}{Operator}
\newcommand{\JS}{JavaScript}
\newcommand{\Ex}{Exhibit}
\newcommand{\Exs}{Exhibits}

% No Javascript support built into listings...
\lstdefinelanguage{JavaScript}
  {morekeywords={var, null},
    sensitive=true,
    morecomment=[l]{//},
    morecomment=[s]{/*}{*/},
    morestring=[b]",
    morestring=[b]',
  }

\begin{document}
\lstset{language=JavaScript,
        basicstyle=\footnotesize,    % print whole listing small
        stringstyle=\slshape,     % typewriter type for strings
        showstringspaces=false,    % no special string spaces
        columns=fullflexible,
        aboveskip=0.1in
       }
\title{PackRat: Adding Structure to your Bookmarks}
\author{Robin~Stewart, Chris~Murphy, and Andrew~Correa}
\date{\today}

\begin{abstract}
This paper presents the design rationale and implementation of a new tool, \emph{PackRat}, for collecting and reviewing structured microformat data from the Web.  \PR acts as a structured bookmarking tool, storing (at the user's request) any identified microformatted data in a SQLite database, along with the webpage url and time of the visit.  This data can later be reviewed through a faceted browsing interface, and exported for further use.  \PR is extremely flexible, and could scale to many thousands of microformat items if an appropriately fast interface to the data store was developed.
\end{abstract}

\vskip 2in
\begin{figure}[!h] % Here because it puts it in a good spot.
\centering
\includegraphics[width=4in]{packrat_logo}
%\label{fig:logo}
%\caption{Neotoma Cinerea}
\end{figure}
\vskip 1in

\maketitle

\section{Introduction}
\lettrine[lines=2, lraise=0, loversize=0.5]%
{M}{ost} information on the current World Wide Web was published with humans as the only direct end-user in mind.  This implicit design assumption underlies even major websites such as eBay, Amazon, and Google, who publish the majority of their webpages in a form that is only meaningful to human viewers.  Some companies, such as Google and Yahoo, have developed API's to allow programmers access to online structured data, but for the average end-user most of the data is presented in a form not easily interpreted by the web browser or user's computer.  Work done by the W3C's Semantic Web Interest Group\cite{swig} attempts to address this problem by introducing a standard format for describing data, called the {\em Resource Description Framework} (RDF).  However, the adoption rate of RDF on the Internet has been slow, due in part to the complexity of RDF, and the requirement that web authors publish their data in a completely separate format alongside existing human-readable web pages.  As a result, developing a website with semantically marked machine-readable data has so far been seen by many web developers as too much work for too little gain.

One effort to tip the balance in favor of semantically meaningful webpages is the work of the microformat community \cite{microformats}.  Microformats provide an unobtrusive and simple way to provide semantic `hints' about structured information contained within a webpage, such as basic contact information, citations, or events.  Many standard data interchange formats already exist for common data, such as vCard for contact information and iCalendar for specifying events.  While many of these formats are easier to create than RDF, they are not easily integrated into existing web pages and still require web developers to duplicate their content and effort.  Microformats provide a method for easily marking up regular, (X)HTML-formatted web pages in a standardized way to provide semantic `hints' about which types of data are being presented.  Many of the microformat standards are based on these existing formats such as vCard, iCalendar, and BibTeX, so that they can be easily converted to their traditional counterparts for use in existing software.  The simplicity with which microformats can be used has led to their deployment on a wide variety of websites, ranging from the calendar for the 2008 meeting of the National Asphalt Pavement Association\cite{asphalt}, to Whitepages.com's directory listings\cite{whitepages}, to reviews on Golf Digest's ``Course Finder''\cite{golfdigest}.

\begin{figure}[!b] % Here because it puts it in a good spot.
\centering
\includegraphics[width=6.5in]{deployed}
\label{fig:deployed}
\caption{Real microformat data displayed using the Firefox Operator extension}
\end{figure}

As semantically marked-up (X)HTML in the form of microformats begins to proliferate, tools have been developed to work with this data and interact with it on individual webpages.  \PR builds on one such tool to provide users with the ability to collect bits of structured information as they browse the World Wide Web, use the data semantics to intelligently filter and browse the data after collection, and export the data in a variety of formats.

\section{Background}
The simplest and standard implementation of a microformat is done by assigning pre-defined, semantically meaningful, ``class'' attributes to (X)HTML containers (such as the SPAN or DIV tag) around each piece of information.  For example, in the ``hCard'' specification (a microformat version of vCard), the preferred way of formatting an individual's name can be identified by including ``fn'' in the class specification:

{\footnotesize\begin{verbatim}
<span class="fn">Joe Schmo</span>
\end{verbatim}} {}

These individual bits of information can then be grouped in a higher level container, and identified as the digital representation of the contact information for a particular individual or organization.  Since (X)HTML container tags are already widely used during webpage styling, semantically marking the data in this way typically requires little additional work for the web developer.  A number of additional tricks can be employed to generate simpler (X)HTML; one example of a legal (though brief) hCard instance is the following:


{\footnotesize\begin{verbatim}
<div class="vcard">
 <a class="url fn" href="http://schmo.com/">
   Joe Schmo
 </a>
 <abbr class="email" title="joe@schmo.net">
   Email Joe!
 </abbr>
</div>
\end{verbatim}}


A Firefox extension called \Op{} has been designed to identify and extract this microformatted data from arbitrary web pages.  If a web page contained the above markup, Operator would extract it into a JavaScript object and display it in a menu listing where actions on the microformat can be taken, such as adding the individual to your contact list or composing an email.  However, once a user browses to a new web page, Operator discards all history about the microformats on previously visited webpages.


\PR gives Firefox the ability to:
\begin{itemize}
\item Archive microformatted data in a scalable, relational database
\item Search and browse the stored data after collection using a faceted data representation
\end{itemize}

\PR is implemented as an add-on script for the Operator extension.  It dynamically creates a SQLite database on the local hard drive, which it accesses via the Mozilla Storage API.  In the following sections, we describe the microformat and database schemas, the collection process, and the export and retrieval process in more detail.  We then present evaluation results, discuss the current system, and suggest future work.

\section{Schema Definition and Collection}

\PR is designed to be highly extensible, and makes use of Operator's existing
microformat definitions to create appropriate database schemas without any additional
user involvement.  Microformats' tight integration with (X)HTML imposes a
number of restrictions on the data, such as requiring that the data be
arranged in a hierarchical manner more matching that of IMS-era databases
than that of current relational databases.  \PR converts this hierarchical
data into tuples for storage in a relational database by following a simple
set of rules.  In addition, \Op{} will identify any microformats which are
properties of microformatted instances as top-level microformats in their own
right, which results in extensive duplication of data in Operator's JavaScript 
representation of microformatted data.  \PR identifies this duplication, and
will attempt to collapse any duplicate microformat instances within a page
into a single microformat instance in the relational database.

\subsection{Operator Schemas}
\Op{} defines a schema for each microformat that it can recognize and work with;
these schemas are simple \JS{} objects that define the properties each microformat
instance can have, along with whether each property can be multi-valued and
what the datatype of its values are (a single typed or untyped value, a
collection of sub-properties, or a microformat).  The schema is quite informal
compared to XML Schema, or RDF Schema.  It is designed to be easy
to create and understand for simple cases, while remaining fairly extensible.
For the most part, the property names represent the ``class'' attribute which
should be added to an (X)HTML element to imply that it contains the appropriate
data value.

\subsection{Operator Schema Property Attributes}
Each microformat property in the \Op{} schema can have a number of attributes,
describing how the data will be formatted.  A property with none of these
attributes is assumed to have a single piece of textual data as its value.
Possible attributes include:

\subsubsection*{plural} If true, this parameter indicates that multiple values of
this property are allowed.
\subsubsection*{values} This paramater, if provided, includes a list of possible
values that the property can take.
\subsubsection*{required} This parameter, if true, implies that the property
cannot be left blank, and must be specified.
\subsubsection*{subproperties} This parameter, if provided, implies that the
property should be treated as a 'group' which contains subproperties rather
than a property in its own right.  Subproperties are specified using the same
schema style, as the value of the subproperties attribute.
\subsubsection*{datatype} A variety of datatypes can be specified, including
`email', `dateTime' and others, though the most important is a datatype of
`microformat', which implies that this property contains another 
microformat instance.
\subsubsection*{microformat} If the datatype for the property is 'microformat', 
this is the string name of the microformat contained by this property, such as
`adr' or `hCard'.\\

For example, a subset of the microformat schema for
the \emph{adr} microformat is shown in Listing \ref{lst:adrSchMF}.  An {\em adr}
microformatted item can contain any of eight properties: {\em type,
post-office-box, street-address, extended-address, locality, region,
postal-code}, and {\em country-name}.  The {\em type} and {\em street-address}
properties are ``plural'', meaning they can take multiple values.  Finally,
the {\em type} property is limited to one of the seven different values
listed (``work'', ``home'', etc.).

\begin{lstlisting}[float,frame=tb,caption={JSON Schema for 'adr' Microformat},label={lst:adrSchMF}]
var adr_definition = {
	"mfVersion": 0.8,
	"mfObject": adr,
	"className": "adr",
	"properties": {
		"type" : {
			"plural": true,
			"values": ["work", "home",
				"pref", "postal",
				"dom", "intl",
				"parcel"]
			},
		"post-office-box" : {},
		"street-address" : { "plural": true },
		"extended-address" : {},
		"locality" : {},
		"region" : {},
		"postal-code" : {},
		"country-name" : {}
	}
};
\end{lstlisting} 

As mentioned, microformats can have properties with microformatted data as
their value; this results in a nested set of (X)HTML container classes representing
all of the data in a hierarchical format.  Figure \ref{fig:hCard} shows one
potential legal nesting of properties for the hCard format, including a geographic
location represented by an inline instance of the geo microformat, and a few
addresses as represented by inline instances of the address microformat.

\begin{figure}[!b] % Here because it puts it in a good spot.
\centering
\includegraphics[width=3in]{hCard}
\label{fig:hCard}
\caption{A subset of the legal hCard properties}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%  CHRIS IS WORKING BELOW HERE %%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{\PR Database Schemas and Collection}
When Firefox loads, a set of database tables is created for each microformat
that does not already have tables in the database.  The database schemas are
dynamically generated entirely from Operator's \JS{} schema objects,
which means that as Operator is extended with new formats \PR is extended as
well.  A single primary table is generated for each microformat, with the
microformat's name as the table name.  This table stores an auto-incrementing
primary key, the time that the microformat was stored, and the URL and host
where it was found.  Storing both the URL and host is redundant but
simplifies querying afterwards.  Additionally, a column is created for each
microformat property which is non-plural, and does not have subproperties.
Each row in this primary table represents a single microformat instance
found in an (X)HTML webpage.

If a property is plural, or `one-to-many', a separate table is generated
containing foreign-key references back to the primary table for that format
and values.  These tables are named by combining the table name and the
property name, separated by a hash sign (\#).  Separate tables are also
generated, following the same naming scheme, for any properties that contain
subproperties.  These tables are given a primary key and include a column for each
property which is non-plural and does not have subproperties; additional
tables are then created as necessary for plural or subproperty fields by again
appending a hash sign and the new property name.  The table creation process
is recursive and can therefore handle any depth of nested
subproperties.  For the simple `adr' microformat referenced previously, three
tables are created: {\em adr}, {\em adr\#street-address}, and {\em adr\#type}.

During the collection phase, a similarly recursive function handles inserting
all of the required tuples.  A placeholder row is inserted at the beginning
of the collection process, and fields are updated with SQL UPDATE statements.
Inserting the placeholder row allows us to
obtain the primary key for the new row, in case it is needed for foreign key
references during inserts in other tables (such as those prompted by having a
microformat value for a property).  Microformatted property values are stored
as a foreign key reference to the appropriate table for the referenced
microformat, and will be automatically inserted in the appropriate tables if
they have not already been stored.

\subsubsection{Additional Thoughts}
  As there's no reason a property with subproperties could not then contain
additional subproperties, there is no theoretical limit to the length of a table name
generated in this manner.

There are a number of additional inferences about the schema that could be 
made.  For one, any property declared as `required' could potentially be declared as
``NOT NULL'' in the SQL property definition.  In reality, however, many 
`in-the-wild' examples of microformats do not contain all the required
properties, and enforcing such constraints would simply decrease the utility 
of \PR as a collection tool, negating any performance or semantic benefits
provided through such constraints.  Additionally, properties declared with a
set of allowed values could be stored as integers, and treated as a sort of 
enumerated list, but this would prevent a telephone number declared as 
`preferred' instead of `pref' from being stored.  In general, \PR attempts
to be as permissive as possible while maintaining the spirit of the schema.

\subsection{Exceptions}
There are a number of exceptions to the standard microformat rules, due in
part to the informal nature of the microformat specifications and in part to
variability in individual implementations.  Some of these exceptions can
prevent Operator from successfully parsing a microformatted item; others are
parsed correctly but do not fit the database schema described above.  One
concrete example is used on the front page of the Microformats.org website.
Events are listed in hCalendar format, which has a `location' property that
is defined as containing an `adr' microformat.  In this instance, however, the
data is the plain text string ``Berlin'' rather than a marked-up microformat
instance.  While equally informative to the reader, this example breaks from
the schema and can confuse parsing.

The current implementation of \PR handles this case by prepending a
single non-numeric character, serving as a `textual identifier' to the value
and storing it in the column.  This causes the column to serve dual use as 
both a foreign key and a text value.  As SQLite does not impose strict types
on individual columns, all columns are treated as text, and this strategy
is not prevented by the DBMS.  While effective, this strategy is
clearly suboptimal.  A number of modifications to the schema could be made
to more effectively store the collected data following a relational model.

Hopefully \PR can be used to help identify some of the variability in
currently deployed microformats, and the database schema can continue to
evolve and encompass as many `edge-cases' as possible.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%  CHRIS IS WORKING ABOVE HERE %%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Retrieval}

Since storing microformats is only useful if they can be retrieved later, we implemented an initial database browsing interface, using the \Ex{} web application\cite{exhibit}.  This requires two main components.  First, since \Ex{} uses yet another general data model, \PR must convert the relational data tables into the types of objects \Ex{} expects.  Second, we format the exhibit ``lens'' display to be appropriate for the microformats being viewed.

\subsection{Exporting to Exhibit's data model}

\begin{figure}[!b]
\centering
\includegraphics[width=5in]{screenExport}
\label{fig:export}
\caption{\PR displaying collected hCard data in an Exhibit, with export formats displayed}
\end{figure}

\Ex{} expects to read in a JSON data file which specifies a directed graph of data objects.  The details are described in \cite{exhibitdata}; for the purposes of this report, we provide an example of a citation data object:

\begin{lstlisting}[frame=tb,caption={JSON data for a document citation},label={lst:citMF}]
{
  "items" : [
  {
    type: "Publication",
    label: "XStream: A Signal-Oriented \
         Data Stream Management System",
    id: "0",
    booktitle: "Proceedings of ICDE",
    "pub-type": "inproceedings",
    year: "2008",
    month: "",
    author: [
      "Lewis Girod",
      "Yuan Mei",
      "Ryan Newton",
      "Stanislav Rost",
      "Arvind Thiagarajan",
      "Hari Balakrishnan",
      "Samuel Madden"
    ]
  },
  /* ... more items ... */
  ]
}
\end{lstlisting}

When a user selects the ``Create an Exhibit'' menu item (added by \PR to the Operator toolbar), \PR generates a new JSON file which contains all of the microformat data of the type specified.  It does this in a similar manner as our microformat collection code: traversing Operator's microformat schema to determine the relevant relational data tables.  The table schema is strict enough that there is a one-to-one mapping between the layout of the tables and the format of the JSON items.  An example item entry of the hCard \MF{} described in Listing \ref{lst:adrMF} might look like the following:
\begin{lstlisting}[frame=tb,caption={JSON data generated by PackRat},label={lst:adrMF}]
{
  type : "adr",
  label : "adr_1",
  "pr_time" : "1197313327",
  "pr_host" : "microformats.org",
  "pr_url" : "http://microformats.org/",
  "street-address" : [
    "747 Howard Street",
  ],
  "locality" : "San Francisco",
  "region" : "California",
}
\end{lstlisting}

We also generate a generic ``schema'' JSON file that gives \Ex{} further information about the properties defined in these data objects.


\subsection{Exhibit formatting}

Each datatype can be designated to have its own specific schema and layout.  These are automatically downloaded from a server if \PR detects they are missing.  As the system currently stands, one must hard-code the location of the schema and (X)HTML files into the \PR extension file.  For the cases where pre-existing schema and (X)HTML files do not exist for the \Ex{} being made, default files are used which display arbitrary microformats decently well.  Thus, new \MFs{} can be viewed with \PR immediately, even if the microformat authors do not specifically design a a format file to go along with it.


% ------------------------------
\section{Evaluation}

To test whether \PR would automatically work with new microformats not originally specified in Operator, we created a simple \MF{} of our own to formalize bibliographic citations, which we called \emph{hCite} (after the popular hCard and hCalendar microformats).  \emph{hCite} contains all of the fields commonly found in publication listings such as \cite{maddenpubs}.  Defining this new microformat only required creating a new Operator schema definition for it.  To test recognition and collection of the microformat, we slightly modified \cite{maddenpubs} to include the correct ``class'' attribute labels.  As planned, \PR correctly created the appropriate tables for this new microformat type, collected the citations from \cite{maddenpubs}, and exported them to a new default-formatted Exhibit interface.  In particular, no new modifications to \PR itself were necessary.

We ran an initial performance evaluation of \PR by measuring its speed in exporting hCard microformatted items to the browsing interface (Exhibit).  This export requires scanning the main hCard table, accessing foreign key references from the supplementary hCard tables, and assembling it all into a JSON file suitable for Exhibit.  The code for doing this is written in JavaScript and run on a 2.8 GHz Intel Core Duo machine.  The results are displayed in Figure \ref{fig:eval}.  These results represent worst-case performance because the entire data set is being exported.  Since the data store is built on a relational database system, operations such as indexed lookups can be done in essentially constant time no matter how big the data set.  Achieving this performance in practice will require tuning the browsing interface to only request the items that are currently visible at any given time; this was far beyond the scope of our project.

\begin{figure}[!b]
\centering
\includegraphics[width=4.0in]{export_time.pdf}
\caption{Time to export stored microformats to the browser interface as a function of database size.}
\label{fig:eval}
\end{figure}


% ------------------------------
\section{Future Work}


The interactions the user has with the \MFs{} stored in the database are currently limited to creating an \Ex{} with all of the stored data of a specific \MF{} type. Moreover, the only way to create \Exs{} is to navigate to a page that has an instance of the desired \MF{} and use the highlighted menu to initiate its creation. This is clearly counter-intuitive since it implies some sort of direct relationship between the content of the webpage and the generated \Ex{} (and requires browsing to a page containing the right \MF{} type). Other desirable additions include:

\begin{itemize}
	\item Creating an \Ex{} with items from the current page only.
	\item Creating an \Ex{} with items that fulfill a certain requirement (e.g. people whose last names start with X).
	\item Allowing arbitrary selection queries against the \PR{} database.
	\item Ability to selectively remove items from the \PR{} database.
	\item An interface allowing users to more easily specify custom \Ex{} formatting files for specific microformat types.
\end{itemize}

In addition, there are a few attributes of properties defined by Operator that
\PR does not currently understand, such as microformat\_property which implies
that the contents of the (X)HTML container should be treated as representing
a specific property of a sub-microformat, instead of a whole new microformat
instance.  As detailed above, PackRat's handling of misformatted microformats
could also be extended.

\section{Conclusion}

\PR as currently implemented is usable up to about 300 \MFs{} of any given type. However, interface optimizations would allow it to scale far beyond that limit. As the project currently stands, it is a good first step in both bringing a new formalization to the \MF{} community and making a widely used (and growing) portion of the semantic web more accessible and useful to casual web surfers.


%\section{Key Points}
%Key project features of interest for the final presentation on 11 December 2007.
%\begin{itemize}
%\item Formal DB schemas from informal microformat definitions.
%\item Dual-usage of columns as both foreign key, and data storage. 
%\end{itemize}

\begin{thebibliography}{99}

  \bibitem{swig} W3C Semantic Web Interest Group. \emph{\url{http://www.w3.org/2001/sw/interest/}},
  visited December 2007.

  \bibitem{microformats} Microformats.org. \emph{\url{http://www.microformats.org}},
  visited December 2007.

  \bibitem{asphalt} National Asphalt Pavement Association | 2008 Annual Meeting Agenda (Tentative) \emph{\url{http://www.hotmix.org/calendar/icalendar/icalagenda.php?Calendar=AnnMeetAgenda}},
  visited December 2007.

  \bibitem{whitepages} Whitepages.com. \emph{\url{http://www.whitepages.com}},
  visited December 2007.

  \bibitem{golfdigest} Course Finder: Abacoa Golf Club: golfdigest.com. \emph{\url{http://www.golfdigest.com/courses/places/2483}},
  visited December 2007.
  
  \bibitem{maddenpubs} Sam Madden's Publications. \emph{\url{http://db.lcs.mit.edu/madden/pubs.php}},
  visited December 2007.

  \bibitem{operator} Mike's Musings $>>$ Operator. \emph{\url{http://www.kaply.com/weblog/operator/}},
  visited December 2007.

  \bibitem{mozStorage} Storage -- MDC.  \emph{\url{http://developer.mozilla.org/en/docs/Storage}},
  visited December 2007.

  \bibitem{exhibit} Exhibit -- SIMILE. \emph{\url{http://simile.mit.edu/wiki/Exhibit}},
  visited December 2007.
  
  \bibitem{exhibitdata} Understanding Exhibit Databases. \emph{\url{http://simile.mit.edu/wiki/Exhibit/Understanding_Exhibit_Database}},
  visited December 2007.
  
  
  


\end{thebibliography}
%%%%%%% OTHER THINGS THAT MAY WANT TO BE BIBLIOFIED.  %%%%%%%%
%Microformats
%http://microformats.org/wiki/Main_Page
%http://www.xfront.com/microformats/index.html
%http://kitchen.technorati.com/search
%
%hReview
%
%http://microformats.org/wiki/hreview
%http://microformats.org/code/hreview/creator
%http://microformats.org/wiki/hreview-implementations
%http://microformats.org/wiki/hreview-examples-in-wild
%
%Operator
%http://www.kaply.com/weblog/operator-user-scripts/
%
%Firefox
%
%http://developer.mozilla.org/en/docs/Storage
%http://developer.mozilla.org/en/docs/mozIStorageConnection
%http://developer.mozilla.org/en/docs/mozIStorageValueArray
%http://developer.mozilla.org/en/docs/nsILocalFile
%http://www.xulplanet.com/tutorials/xultu/chromeurl.html
%http://www.borngeek.com/firefox/toolbar-tutorial/
%
%Exhibit
%
%http://simile.mit.edu/wiki/Exhibit
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{document}
