ExtractingTopicDataToADatabase
--
AlanBurlison - 20 May 2004
Summary
I've been using TWiki for a workflow-type application, and I wanted to create some summary report pages to track progress, look for missing information and spot errors. I also needed to be able to extract key information (e.g filenames and status) for processing in another application, and although there are plugins that give database-like access to the TWiki topic data (e.g.
ADatabaseCalledTWiki) they crucially
don't allow external applications to access the TWiki data using
SQL. I rapidly hit the limits of the TWiki
%SEARCH{...}% mechanism, especially when dealing with repeating groups in a topic (e.g. lists of files), so I ended up extracting the key information into a set of tables in a
SQLite
database using the perl
DBI
interface. This was really remarkably easy to do, and allowed me to combine the unstructured-but-flexible nature of a Wiki with the structured-but-inflexible attributes of a database. I know there's been much discussion of the interplay between TWiki and databases, so I thought I'd write up my experiences in the hope that they'd help others and stimulate some discussion.
Related topics
WorkFlow,
WorkFlowAddOn,
TwikiMetadata,
DatabaseForMETA,
MetadataSucks,
PluginDataStore,
TWikiWhatWillYouBeWhenYouGrowUp,
PerlDBI
Example topic data
I used forms and templates along with a series of customised 'create a new topic' forms to impement 'typing' of topics. One of the topic types that I wanted to query against contained a list of filenames, and I wanted to extract each filename into a seperate row in a datbase. The raw topic text looks something like:
---+ <nop>%TOPIC%
---++ Owner
Fred Bloggs
---++ Summary
This topic blah, blah, blah....
---++ Files
/for/bar/baz %BR%
/some/other/file %BR%
/yet/another/file %BR%
/and/so/on/and/so/forth %BR%
Database mapping of topic data
For the particular topic type shown in the example above, the web name, topic name, the owner, summary block text, the topic metadata and the form data would appear as one row in a table (
table1), and for each filename a row containing (
web,
topic,
filename) would appear in a detail table (
table2). There are several other topic types which map onto different table schemas, depending on the contents and the presence of repeating groups etc. All the various tables use as a minumum (
web,
topic) as the primary keys.
Query interface
I'll show the query interface first, then explain how I got the data into a database. I tried to keep the syntax of the queries as close as possible to that of
%{SEARCH{..}% variable. A typical query embedded in a page looks like this:
%QUERY{"select t1.owner, t1.topic, t2.filename from table1 t1, table2 t2 where t1.web = 'Myweb' and t1.web = t2.web and t1.topic
= t2.topic order by 1, 2, 3" break="1" header="---+++ Owner: #1\n| <nop>*Topic* | <nop>*Filename* |" format="| #2 | #3 |"}}%
The
query attribute specifies the
SQL to be used to query the database, the
break attribute specifies the report break column (if required), the
header attribute specified the header to be used on the table/break block, and the
format attribute allows you to specify the format for each table row. In the
header and
format attributes each
#n is replaced with the appropriate column from the query result. The query in the example above would expand to something along the lines of:
Owner: Fred Bloggs
| Topic |
Filename |
| TopicOne |
/for/bar/baz |
| TopicOne |
/some/other/file |
| TopicOne |
/yet/another/file |
| TopicOne |
/and/so/on/and/so/forth |
| TopicTwo |
/somewhere/else |
| TopicTwo |
/dev/null |
Owner: Peter Thoeny
And so on.
Data extraction from TWiki
This is done using a custom plugin, and by providing a
beforeSaveHandler that looks at the form type of the topic and passes control to the appropriate per-topic-type handler, e.g.
if ($_[0] =~ m{%META:FORM{name="TypeOneForm"}%}) {
scan_type_one(\$_[0], $Topic);
} elsif ($_[0] =~ m{%META:FORM{name="TypeTwoForm"}%}) {
scan_type_two(\$_[0], $Topic);
}
The individual type handlers then delete any existing rows for the topic from the appropriate tables, parse the topic text and insert the new rows into the tables.
Query rendering
The query rendering is done in a
startRenderingHandler. The code is relatively simple, the bulk of it being concerned with parsing the query parameters.
Other issues
The only wrinkle is that TWiki doesn't provide a 'rename topic' handler, so if a topic is deleted the rows still appear in the database. I worked around this by writing a cron job that once a day scans all the topics in the Web, rebuilds the tables and deletes the rows for topics that no longer exist.
Alan, this is very interesting. Is this plugin (plus some sample code interacting with the
SQL database) something you could share with the wider community?
--
ThomasWeigert - 20 May 2004
The plugin has lots of proprietary goop in it so I can't share the whole thing, but I can post the bits that do the database stuff. Unfortunately it's not completely generic as it knows the layout of the various template topics, but some of the stuff (e.g. the query handler) could be easily resused. I'll attach the stuff to this topic later on. --
AlanBurlison - 20 May 2004 - 11:34
--
AlanBurlison - 20 May 2004
You may want to have a look at
FormQueryPlugin, which caches form data in a database for rapid queries using a cut-down
SQL.
--
CrawfordCurrie - 20 May 2004
Thanks, and I had looked at it, but it is a) limited to form data, b) isn't full
SQL and c) doesn't allow for the data to be queried outside of TWiki, which are my primary requirements.
--
AlanBurlison - 20 May 2004
OK. Just one clarification; the
FormQueryPlugin is
not limited to form data. It reads structured tables from the topic, and also stores the topic text. Points (b) and (c) are both good ones! This is a great idea, and I'm interested in finding out more, especially how you organised the database. I have wanted to see the TWiki Store implemented in a database (as an alternative to the text form) for some time. Caching the text DB into an
SQL database on the fly is an intriguing idea that had not occurred to me.
- What DB did you use?
- How did you cope with new form types being added?
- And how did you do the SQL? Did you parse the query in the perl (e.g. via SQL::Statement), or throw it straight at the DB?
- If in the perl, did you use one of the CPAN SQL parsers?
(Just FYI,
SharedCode includes a standard parameter parser module -
Attrs.pm - that might have saved you some code, though it wouldn't help with parsing
SQL)
- Do you have any comments on the performance of the
beforeSaveHandler?
- Does the %QUERY writer have access to queries already cached in the DB?
--
CrawfordCurrie - 21 May 2004
This looks very similar to the latest setup I have at my company.
I wrote a Plugin that is a wrapper around SQLite using Perl DBI (basically to reduce some common code to a single function call), and another Plugin to store form metadata in the database.
What I did was to create (dynamically) one table per form. A sample DDL is as follows:
CREATE TABLE TicketLifecycleForm (EntryType TEXT,Summary TEXT,TicketPriority TEXT,AffectsRelease TEXT,Description TEXT,Date TEXT,web TEXT, topic TEXT);
Also, an index/PK over
(web,topic) is created.
The data is extracted and saved on the
afterSaveHandler handler. I did this because the META info may be changed on the
beforeSaveHandler handler. If the table does not exist, it is created. The plugin also provides a
rest action to process a whole web.
My current implementation has some limitations:
- It is not possible to two forms with the same name and different structure on different webs.
- If the topic has some FORMFIELDs that is not in the form definition, the insert will fail.
- The plugin is not managing modifications in the structure of the Form.
If someone is interested I can take time to clean up the code a little and publish them to
SVN.
--
RafaelAlvarez - 18 Aug 2008