Spreadsheets contain valuable data on many topics, but
they are difficult to integrate with other sources. Convert-
ing spreadsheet data to the relational model would allow
relational integration tools to be used, but using manual
methods to do this requires large amounts of work for each
integration candidate. Automatic data extraction would be
useful but it is very challenging: spreadsheet designs gener-
ally requires human knowledge to understand the metadata
being described. Even if it is possible to obtain this meta-
data information automatically, a single mistake can yield
an output relation with a huge number of incorrect tuples.
We propose a two-phase semiautomatic system that ex-
tracts accurate relational metadata while minimizing user
effort. Based on conditional random fields (CRFs), our
system enables downstream spreadsheet integration applica-
tions. First, the automatic extractor uses hints from spread-
sheets’ graphical style and recovered metadata to extract
the spreadsheet data as accurately as possible. Second, the
interactive repair component identifies similar regions in dis-
tinct spreadsheets scattered across large spreadsheet cor-
pora, allowing a user’s single manual repair to be amortized
over many possible extraction errors. Through our method
of integrating the repair workflow into the extraction system,
a human can obtain the accurate extraction with just 31% of
the manual operations required by a standard classification
based technique. We demonstrate and evaluate our system
using two corpora: more than 1,000 spreadsheets published
by the US government and more than 400,000 spreadsheets
downloaded from the Web.
|