Tags:
create new tag
, view all tags

Report Plugin Idea

In FormQueryPlugin and DBCacheContrib there is an implementation of handling structured data in TWiki. All my ideas are basically stolen from there. If you are looking for a working solution, look there. But please tell me about your requirements (use the comment box at the end of this topic)!

I try to propose a more complete and general way of handling structured data in TWiki.

Unfortunately I'm not a Perl programmer and don't have much time in the near future to implement this. But I'm starting to share my thoughts. If I get some good feedback from you, I hope there's something getting out that will be worth learning Perl and spending my time smile I expect that some things have to get much simpler first...

And feel free to turn my Pidgin English to real phrases.

Where I Don't Want To Go

No TWiki Data in a Database (like MySQL).

I chose TWiki among other things because it stores data in textfiles. A Database would make TWiki harder to install and maintain and more dependent of certain technologies.

Files are always the easiest way to store Data. You can use all of your favorite editors and file-managing tools.

No Re-Invention of SQL

On the one hand because it's to complicated and would be never as good as already existing database tools. But more than this, SQL just sucks. It's wannabe-easy but very hard for ordinary people to learn. And It's much to powerful for this application.

Data Structure

All is based on tables and links. Tables are uniformed data structures. Links are relations or references allowing one to jump from one table to another.

The diagrams show the structure like it appears for the user. I does not mean that the data is stored like that in the cache. More about the cache later.

Basic Structure

Basic Structure

The basic structure covers the webs, topics, revisions, attachments and users. The textlines are just a syntactical trick to query multiple lines in the same topic.

There could be more, like groups, offices, plugins and so on. I think it's complex enough to start...

There is the option to have the structured data from all the former revisions. (Is this of much use? It would be possible to query "who changed when whitch field to what value" ...) For the moment, the topic link of the revisions table points always to the current topic, not the topic of the revision.

Web preferences are part of the dynamic structure, if it is definied in forms.

Dynamic Structure

Dynamic Structure

Forms and embedded tables are dynamically created tables. Embedded tables must be formatted using the EditTablePlugin, definied in a separate topic. This topic is the definition topic of this table and has to be declared in the ReportPlugins konfiguration.

There are predefined links to the topic they are stored in, but nothing more.

The basic strucure could be enhanced by defining user-defined links. This links belong to a user-defined table (TWikiForm or EmbeddedTable) and relate to any other table. For instance, is there a topicname in a formfield, a link could be definied to this topic. This is done with %LINK{...}%.

Meta Structure

Metadata

Because of its extensibility, metadata should be available in tables and fields. There is currently no idea about a syntax that allows one to access data of a field which name is taken from another field.

Links

While accessing the data one could always follow the links to another table. Every link is a reference in the table which owns the link. A link points always to exactly one row in the foreing table.

One could not follow a link backward. This is because the backward relation is always on-to-many and thus ambiguous. There is the possibility with Nested Reports to list foreign tables that have a "many" relation.

The Cache

There should be a chache like the DBChacheContrib. I propose only to put the structured parts (metadata, forms and embedded tables) into the cache. The full text could be taken directly from the files. So the fields text, summary, textline.text should not be in the chache.

Functions

There are ordinary functions and aggregat functions.

Ordinary functions

func comment
format_time(time, format) Format a time field to a string. Formatstring like 'yyyy-mm-dd hh:mi:ss'.
format_num(number, format) Format a number to a string. Formatstring like '###\'###\'##0.00' .
to_time(string, format) Evaluates a string and turns it to a internal time value (to compare time fields with constants or user-input)
to_num(string, format) Evaluates a string and turns it to a number. (necessary?)
calc(...) Everything between the brackets is passed to the SpreadSheetPlugin, if installed. Data fields are passed as parameter, there should be a explicit syntax for that. I have to check if there is no syntactical problem.
? ...

Aggregat Functions

func comment
sum(field)  
min(field)  
max(field)  
avg(field) average
=any(field)  
count Count rows in the group
distinct(field) Count distinct occurrences of a value. (Or list of values?)
list(field) List all distinct values in the group and separates it with ", "

Queries

Actually queries are mostly intended to group and filter data several times. Simple things could directly be made in a report. I would like to have Queries definied in the plugins topic to be chached, while others are not.

Queries result always in a tabular structure. There is always only one source for a query. Because of the links it is not necessery to have more than one (believe me!)

Data could be grouped (like GROUP BY in SQL). Grouping allows the use of Aggregate Functions. Links that are not grouped by are lost.

Grouping generates implicitly a query named queryname_detail. The detail query is composed of a link to the original row called row and a link to the group called group.

Reports

Reports are used to print the data. Reports could be nested. You could define a Query ad hoc when defining a report.


Syntax Proposition

Bold parameters are not optional.

%QUERY{...}% Statement

Parameters for %QUERY%:

param comment
name Name of the query.
source The data source: one table (topicinfo, attachments, revinfo, Formname or EmbeddedTableName) or another query
first_row First record to display
row_count Print at most row_count records.
filter a filterexpression. Format: "field='value'" to filter to a certain value. Instead of = use ~ (regex), >, <, >=, <=, =.
groupby fields to goup by or onerow to group fields to one row.
calculate a list of regular functions to evaluate and turn to regular columns. Format: "expression as columnname, expression as columnname ... " (as columnname is optional)
aggregate a list of aggregate functions to evaluate and turn to regular columns. Format like calculate.

  • Filtering is always done before grouping. If you want to filter the result of the group, you have to define another query.
  • calculate is always done before grouping. If you want to calculate with grouped values, use aggregate instead.
  • aggregate is always done while grouping. If you want to group or filter by aggregated values, you also have to define another query.
  • Queries are not ordered. Sorting is done in a %REPORT%

Questions

  • I don't know how flexible filter will be. Is "function(field)>field" possible?
  • What about datatypes? What values are compared as string or as number? (Same for sorting)

%REPORT{...}% Statement

There is the %REPORT{...}% and %ENDREPORT% statement, between which the fields of the query are available. An inner nested part refers to the surounding one. There must be a link between the source of the inner and outer report part.

Parameters for %REPORT{...}%:

param comment
name Name of the report part to refer to.
source like in %QUERY%
first_row like in %QUERY%
row_count like in %QUERY%
filter like in %QUERY%
calculate like in %QUERY%
groupby like in %QUERY%
aggregate like in %QUERY%
order a list of fields to sort by. Use "-Fieldname" for descending order
link The links to use to join the source of the outer report part, if is should be linked

%FIELD{...}% Statement

Between %REPORT{}% and %ENDREPORT%, fields are accessed simply using %FIELD{"fieldname"}%.

Parameters for %FIELD{"fieldname"}%:

param comment
"fieldname" Name of the field.

A "fieldname" is composed as name, link.name, link.link.name etc. reportname:name, reportname:link.name and so forth. (creation.time, parent.current.user.name, parent.Product.Producer, myreport:topic.name).

%LINK{...}% Statement

Parameters for %LINK{...}%:

param comment
name The name of the link. With this name the target table could be reached from the source table
table the data source: a userdefinied table (Formname or EmbeddedTableName)
target the link target, any table.
fields List of fields in the source table pointing to one of the target records, corresponding to targetkey.
targetkey List of fields in the target table that uniquely define a record in the target table


Examples

Topics and Revisions

Get a list of topics which contain SEARCH. Render it with a list of all contributers.

%REPORT{name="topicpart" source="topics" filter="text~'.*SEARCH.*'" order="creaton.time"}%
---++ %FIELD{"name"}%

| revision          | (%FIELD{"current.revision"}% |
| author            | (%FIELD{"current.user.name"}% |
| revision time     | (%FIELD{"format_time(current.time, 'dd.mm.yyyy hh:mi:ss')"}% |

---+++ Former Revisions and Contributers
%REPORT{name="revpart" source="revisions" link="topic" order="revision" filter="revision < topicpart:current.revision"}%
| *rev*                     | *user*                    | *date* |
| %FIELD{"revision"}% | %FIELD{"user.name"}% | %FIELD{"format_time(time, 'dd.mm.yyyy hh:mi:ss')"}% |

%ENDREPORT%
_End of %FIELD{"name"}% revisions._

%ENDREPORT%

Recent Changes in all public webs

Similar to the WebChanges. (I'm not sure that this %URLPARAM{...} could work...)

%REPORT{source="webs" order="name" filter="public='1'"}%

---+ Recent Changes in %FIELD{"name"}% - Web

%REPORT{source="topics" link="web" order="-time" row_count="%URLPARAM{\"rows\" default=\"50\"}%"}%

---++ %FIELD{"name"}% - %FIELD{"format_time(current.time, 'dd.mm.yyyy hh:mi:ss')"}% - %FIELD{"current.user.name"}

%FIELD{"summary"}%

%ENDREPORT%
%ENDREPORT%

See
[[%SCRIPTURL%/view/%WEB%/%TOPIC%?rows=50][50]],
[[%SCRIPTURL%/view/%WEB%/%TOPIC%?rows=100][100]],
[[%SCRIPTURL%/view/%WEB%/%TOPIC%?rows=200][200]],
[[%SCRIPTURL%/view/%WEB%/%TOPIC%?rows=400][400]]
most recent changes.

Large Attachments

Gets a list of all attchments larger then 1MB.

| *filename*       | *size*         | *topic*             | *uploaded from*   | *upload time* |
%REPORT{source="attachments" order="-size" filter="size > '1000'"}%
| %FIELD{"name"}%    | %FIELD{"size"}%  | %FIELD{"topic.name"}% | %FIELD{"user.name"}% | %FIELD{"format_time(time, 'dd.mm.yyyy hh:mi:ss')"}% |
%ENDREPORT%

Statistics

Produces statistics like the WebStatistics, without "views", because "views" are not available. (Could be done if the log were a embedded table.)

Group by month to get monthly revisions.

%QUERY{name="monthly_revs" source="revisions" calculate="format_time(time, 'yyyy-mm) as month" groupby="month" aggregate="count"}%
Group the implicitly generated detail query by user and group (this is the link to monthly_refs). This results in a row per user and month.
%QUERY{name="monthly_user_contibutes" source="monthly_revs_detail" groupby="group, row.user.name as username" aggregate="count" }%

<TABLE>
<TR>
  <TH>Month</TH>
  <TH>Topic Changes</TH>
  <TH>Top Contributors</TH>
</TR>
%REPORT{source="monthly_revs" order="-month" }%
<TR>
  <TD>%FIELD{"month"}%</TD>
  <TD>%FIELD{"changes"}%</TD>
  <TD>%REPORT{source="monthly_user_contibutes" link="group" orderby="-count" row_count="10" }%
   * %FIELD{"count"}% %FIELD{"username"}%
  %ENDREPORT% </TD>
</TR>
%ENDREPORT%
</TABLE>

Top Ten Contributers

Counts only the new revisions a contributer has made.

%QUERY{name="contributions" source="revisions" groupby="user" calculate="count, min(time) as min_time, max(time) as max_time" }%

| *name*  | *contributions*  | *between*  |
%REPORT{source="contributions" order="-count" row_count="10"}%
| %FIELD{"user.name"}%  | %FIELD{"count"}%  | %FIELD{"format_time(min_time, 'mm.yyyy')"}% - %FIELD{"format_time(max_time, 'mm.yyyy'"}% |
%ENDREPORT%

My Contributions

A List of every topic I ever contributed in.

%REPORT{source="revisions" filter="user.name='%USERNAME%'" order="topic.name"}%
   * %FILED{topic.name}% rev. %FIELD{"revision"}% - %FIELD{"time"}%
     (current rev. %FIELD{"topic.current.revision"}% - %FIELD{"topic.current.time"}%)
%ENDREPORT%

-- StefanSteinegger - 21 Mar 2005


Discussion

Stefan, I am not able to give you a thorough design review due to lack of time but I think what you are trying to do is very worthwhile.

I have one idea to make this a little easier: Why don't you

  1. generate a set of tables (as you describe above) in the cache and then
  2. use DatabasePlugin (or similar) to query that cache.
I realize you said you don't like SQL, but this would give you a quicker and standard way to achieve your goal, and it would make it possible for people to use either TWiki or an underlying database interchangeably.

-- ThomasWeigert - 23 Mar 2005

Thanks for your feedback Thomas. It's not the problem, that I dont like SQL, I know it inside out. But I don't want that TWiki users have to cope with.

Secondly I dont want that the Plugin (and TWiki at all) is too complicated to install, because one have to install a Database to run it cleanly. Then you start to question why there are textfiles to store topics, if you need a database anyway. But this is just a cache, original data is still stored in the topics.

But it's possibly the quicker way to get any results. I'll think about it. There is certainly a driver for a database in a textfile or something.

-- StefanSteinegger - 23 Mar 2005

It's re-inventing the wheel a bit (FormQueryPlugin already implements much of the above) though it has some interesting ideas, and I guess I need to comment.

It has been my intention for some time to re-do the FormQueryPlugin, treating it as a prototype. But my thoughts were taking me in a couple of different directions:

  1. I want the core to implement a TopicObjectModel - basically do what the DBCacheContrib currently does, but within the core.
  2. I want to provide a variety of query interfaces to this data - FormQueryPlugin expresses one, this idea expresses another. The one I am most interested in leverages the CPAN SQL parser to implement a subset of SQL.
  3. Thomas's idea of using a DB as a cache is an interesting one. Any thoughts about performance (populating the cache)?
  4. To me, text is just another field in the data model. I want to be able to plug in different "parsers" that extract structured data (such as TWiki tables, headings, section tags etc.) from topic text and incoporate them in the data model.
If there is synergy with your thinking, Stefan and Thomas, I'd be delighted to work with you on this...

-- CrawfordCurrie - 23 Mar 2005

I hoped that you comment my idea, Crawford. I like your FormQueryPlugin very much. As I already mentioned: my ReportPluginIdea is basically the FormQueryPlugin, with some differences

Differences to the FormQueryPlugin

  • The data structure of the FormQueryPlugin is tree-like. So you have to "flatten down" the data, if you want to query branches like revisions or attachments using extract. It's the power of relational databases that this is not necessery, because there is no root. You can start querying wherever you need. It should be business of the report to build trees, not of the model.
  • I introduced a group by feature, this is very powerful.
  • I never liked the format parameters in %SEARCH{...}% and similar functions very much. It's ok to format a single line. You can write nested %SEARCH{...}%'s, using escape characters. This is actually not readable for human beings after the third level smile I'm not sure if FormQueryPlugin doesn't allow nesting %SHOWQUERY{...}% or if it suffers the same syntactical problem. So I think that %REPORT{...}% and %ENDREPORT% could be incredibly cool.
  • I use uniformed tables, instead of hashes.

Did you notice how easy it is to produce this queries and reports? Of course we need some more "case studies" to decide if it covers all the possible needs. But I'm confident that many requirements would be ridiculously easy to satisfy.

Interface to other plugins

Your idea of the query interface is interesting. I didn't care about so far. The text field is part of my model (but not part of the cache, a fact that users never have to care about). I would only put structured data into the cache. I propose the interface to plugins like that:

  • plugin triggers (like the commonTagsHandler) could be fired while parsing the data
  • plugins use an API to create new tables and put structured data in it.
  • %REPORT could be used to render the plugins data.
  • The plugin can use the API to access the data in the model and provide TWikiVariables to do anything.

about the Cache

I'm not sure that we need a Database. I believe that the structured data could be fully in the servers memory. SQL could be more complicated then just using hashes and arrays.

Then we need a mechanism to find out if the chache is out of date.

  • The cache has a last update time
  • Every topic that definies a dynamical table (form, edit table definitions) is checked first. If it is newer then the cache, the whole table has to be updated.
  • all topics are checked if they are newer then the chache and updated.
  • delete records of topics that dont exist anymore

About the Links

I need links between already joined data tables similar to the FormQueryPlugin. There is no need to filter a cartesian product to the set of right combinations. And it could be really fast. Tell me how to join tables easier! And its used for the nested reports to link inner and outer parts together. Linking of user definied data tables using %LINK{....}% would allow the user to use this feature on user defined tables as well.

-- StefanSteinegger - 23 Mar 2005

Guys, I am extremely delighted that you are looking at this. One shortcoming that TWiki has is the lack of support for generating complex, fast queries against its topics, with the topics interpreted as structured data. And I do agree, the topics should not be viewed as tree structured data, but as a set of relational tables. Or at least, both views should be possible.

FQP is a great start, as Crawford described, but

  • I like the thinking that Stefan is describing, and
  • There seem to be enough deployment issues with FQP that scares people away.

I still wonder why we could not create an efficient representation of a TWiki web in a cache and then use existing relational query capabilities to do the queries (I am not talking about rolling our own SQL data base here, but to have some minimal support for computing joins).

-- ThomasWeigert - 23 Mar 2005

I still think that serialized arrays and hashes are good enough... Don't forget it's only a cache for faster access. The data is already stored in files... The model is a way to think not a way to store data.

By the way: I suggested webs to be a table in the Basic Structure. So there should not be a cache per web.

-- StefanSteinegger - 23 Mar 2005

You may get yourself into a performance issue if you have a cache for all of TWiki. In our installation there are many webs with thousands of topics, and PeterThoeny reports even larger numbers at his installation. Many of these will never be queried by this mechanism.

-- ThomasWeigert - 23 Mar 2005

Thats why I think that only structured data should be in the cache. If you have to read th whole topic text body, you can read it directly from the topic files.

Most topics are composed of the topic name and the revisions (date, revision number and link to the user). If the topic has a form, there is the formname and the formdata (only values, no fieldnames, they are part of the table definition). Then attachments, if there are: name, size, user (link), date, revision, I forgot the comment. Lets say an average of less then 1 kByte per topic. For every thousend topics you'll need less then 1 MB of chache...

It's certainly a good idea if you can put some parameters about data that is used: which webs, which forms, which embedded tables and so on.

-- StefanSteinegger - 23 Mar 2005

I think the key items are:

  • All metadata
  • All tables in text

I am less worried about attachments, but that would be covered by metadata anyway...

-- ThomasWeigert - 23 Mar 2005

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng BasicStructure.png r1 manage 9.4 K 2005-03-21 - 14:12 StefanSteinegger Basic Structure
PNGpng DynamicStructure.png r1 manage 5.4 K 2005-03-21 - 14:12 StefanSteinegger Dynamic Structure
PNGpng Metadata.png r1 manage 3.0 K 2005-03-21 - 14:13 StefanSteinegger Metadata
Unknown file formatjude ReportPluginIdea.jude r1 manage 12.8 K 2005-03-21 - 14:15 StefanSteinegger UML diagrams
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r10 - 2005-03-23 - ThomasWeigert
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.