Did anyone do some research with an alternative engine implementation
(PCRE, ICU, V8, etc.) and can provide some results?
yes, an intern in the trolltech berlin office - in summer 2008.
here is the text ((mostly) verbatim cwiki source code) and a tar
file with some experimental sources.
of course it does not cover all current contenders and the numbers are
probably somewhat outdated.
<noautolink>
---+!! Regular Expressions and Qt
%TOC%
---++ Current Issues with QRegExp
---+++ High Level
* QRegExp API is broken
* see *T7* in [[#Low_Level][Low Level]]
* QRegExp is used for QtScript thought it does not fullfill the ECMAScript specification ([[http://www.ecma-international.org/publications/standards/Ecma-262.htm][ECMA-262-1999]]). Missing features include
* Non-greedy quantifiers (see page 141 titled "- 129 -")
* Patternist/XPath also needs Regex features not found in QRegExp, including
* Non-greedy quantifiers ([[http://www.w3.org/TR/xpath-functions/#regex-syntax]])
* Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of *T6* described below. GtkSourceView has exactly [[http://bugzilla.gnome.org/show_bug.cgi?id=134674#c1][that problem]]...
* Customer ***** complained about QRegExp (though I don't see what's their exact problem):
<blockquote style="margin-left:45px;background-color:#efefef;padding:7px 20px 7px 20px">
In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.
</blockquote>
* Public task request:
* Lookbehind (*T4*) ([[http://trolltech.com/developer/task-tracker/index_html?id=217916&method=entry][bug 217916]])
* Support for POSIX syntax ([[http://trolltech.com/developer/task-tracker/index_html?id=218604&method=entry][bug 218604]])
* Removing const modifiers (*T7*) ([[http://trolltech.com/developer/task-tracker/index_html?id=219234&method=entry][bug 219234]], [[http://trolltech.com/developer/task-tracker/index_html?id=209041&method=entry][bug 209041]])
* Non-greedy quantifiers (*T3*) ([[http://trolltech.com/developer/task-tracker/index_html?id=116127&method=entry][bug 116127]])
---+++ Low Level
* *T1*: ^ (caret) and $ (dollar) cannot match at each newline
* *T2*: . (dot) always matches newlines
* *T3*: lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
* *T4*: lookbehind is not supported (lookahead is)
* *T5*: lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern ".*" returns 3, not 0
* *T6*: only linear input is supported, for a text editor like Kate this does not scale
* *T7*: QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.
---++ Future (or what it could be)
---+++ Engines
---++++ Commonalities of Boost.Regex and PCRE
* (+) *T1*: ^ (caret) and $ (dollar) can be configure to match at each newline or not at runtime
* (+) *T2*: . (dot) can be configured to match newlines or not at runtime
* (+) *T3*: Support for non-greedy quantifiers
* (+) *T4*: Support for lookbehind
---++++ Boost.Regex
* (+) Can match text from a non-linear sources like an array of lines
* (+) An integration prototype is working already:
* With matcher-match separation (*T7*)
* On real QStrings or with non-linear input (*T6*)
* With indexIn and proper lastIndexIn implementation (*T5*)
* see [[#Current_status_of_Boost_Regex_in][Current status of Boost.Regex integration]]
* (--) Code size
* [[http://www.boost.org/doc/tools/bcp/bcp.html][bcp]] (which stands for "boost copy") helps to rip out Boost.Regex from a full Boost package
* Current metrics for Boost.Regex and all required Boost components are 565 files, 4.6 MB uncompressed.
---++++ PCRE
* (+) Code size: Code already shipped with Qt as it's used in Webkit
* No UTF-16 support upstream but in Webkit. // TODO Bridge to QStrings?
---+++ Strategies
Lists alternative strategy candidates to improve the current situation.
---++++ Core
Strategies integrating a new Regex engine and it's implemenatation in QtCore.
* (--) Increases the code and binary size of QtCore
---+++++ *C1*: Replace current Regexp/Regexp2 engines of QRegExp through another engine in general
RegExp currently comes with two different regex pattern syntaxes: "RegExp" and "RegExp2". The latter was introduced to fix a problem with the former without breaking compatibility:
"For historical reasons, quantifiers (e.g. '*') that apply to capturing parentheses are more "greedy" than other quantifiers. For example, the pattern 'a*(a)*' will match 'aaa' with cap(1) == 'aaa'." [[http://doc.trolltech.com/4.4/qregexp.html#capturing-text]]
I don't see how this behavior could be imitated by any other Regex library in general. If this is true and we need to keep this feature, we cannot replace QRegExp's "RegExp" engine but only put new engines next to it.
Also if we decide to replace an existing syntax/engine with a new one this will have big potential to break customer code (not API, but ABI in a way)
---+++++ *C1.1*: Replace current Regexp/Regexp2 engines of QRegExp through Boost.Regex
* (--) There is at least one small thing that QRegExp can do but Boost.Regex cannot: Back reference in a pattern can go to 10 or higher in QRegExp (e.g. "\\10") but only up to \9 in Boost.Regex (just as in Perl). If this is true (a) Boost.Regex cannot transparently replace QRegExp's "RegExp2" implementation unless we decide to reduced the number of back references to a maximum of 9.
* See C1
---+++++ *C1.2*: Replace current Regexp/Regexp2 engines of QRegExp through PCRE
* See C1
---+++++ *C2*: Put Boost.Regex as a new engine next to Regexp/Regexp2 in QRegExp
---+++++ *C3*: Put PCRE as a new engine next to Regexp/Regexp2 in QRegExp
---+++++ *C4*: C2 + C3
* (?) Language features too similar?
---++++ Module
Strategies centered around the creation of a new Qt module.
* (+) Does *not* increase the code or binary size of QtCore
---+++++ *M1*: Boost.Regex as a new module QtRegex
---+++++ *M2*: PCRE as a new module QtRegex
---+++++ *M3*: Boost.Regex *and* PCRE as a new multi-engine module QtRegex
* (?) Language features too similar?
---+++++ *M[4..6]*: M[1..3] + Loosely integrate new Regex API into QtCore (QString, ..[?]) through abstract classes
Could look like this:
* class QString
* bool contains(const QRegExp & rx) const;
* bool contains(const *QAbstractRegexEngine* &matcher) const;
* ..
* int indexOf(const QRegExp & rx, int from = 0) const;
* int indexOf(const *QAbstractRegexEngine* &matcher, int from = 0) const;
* ..
---++++ Labs
---+++++ *L1*: Publish Boost.Regex integration as a project on Trolltech Labs
* (+) We might get detailed feedback from customers
* (+) Customers get new (alpha or beta) Regex code earlier
* (+) We can still integrate the same code later
---++ Current status of Boost.Regex integration
---+++ Concepts
* Stay close to QRegExp where it doesn't hurt to make people feel at home
* Seperate matcher and match object
* One match/matcher pair for plain strings, another pair for non-linear input (from a "feeder")
---+++ Known todos
* try to get templates out of the feeder API in a smooth way
* integrate x-modifier (already supported by Boost.Regex)
* fix coding style violations
* d pointers
---+++ Open questions
* Better class names? @YOU: Ideas?
---+++ Binary size
.. of what currently is libQtRegex:
| *Linux 32Bit* |||
| *Symbols* | *Linking* | *Size* |
| Debug | Dynamic | 4,069,988 |
| Debug | Static | 4,280,300 |
| Release | Dynamic | 492,830 |
| Release | Static | 449,410 |
| Release/stripped | Static | 333,548 |
| *Windows 32Bit* |||
| *Symbols* | *Linking* | *Size* |
| Debug | Dynamic | 4,290,586 |
| Debug | Static | 4,845,780 |
| Release | Dynamic | 412,672 |
| Release | Static | 437,760 |
| Release/stripped | Static | 437,760 |
Legend:
* Dynamic - linked as shared library
* Static - linked as executable with demo code
* Size is in bytes
These number are
* for the Boost.Regex code and its Qt integration together
* still moving up and down a little ...
---+++ Public API
(without constructors)
* class *QRegexEngineBase*
* enum *AnchorMode*
* AnchorWontMatch
* AnchorAtStartEnd
* AnchorAtEachLine
* Qt::CaseSensitivity *caseSensitivity* () const;
* bool *isValid* () const;
* int *numCaptures* () const;
* QString *pattern* () const;
* void *setCaseSensitivity* (Qt::CaseSensitivity cs);
* void *setPattern* (const QString & pattern);
* class *QRegexEngine* : QRegexEngineBase
* QRegexMatch *exactMatch* (const QString & input, AnchorMode anchorMode = AnchorAtStartEnd) const;
* QRegexMatch *findFirst* (const QString & input, int startOffset = 0, int endOffset = -1, AnchorMode anchorMode = AnchorAtStartEnd) const;
* QRegexMatch *findLast* (const QString & input, int startOffset = 0, int endOffset = -1, AnchorMode anchorMode = AnchorAtStartEnd) const;
* QList<QRegexMatch> *findAll* (const QString & input, int startOffset = 0, int endOffset = -1, AnchorMode anchorMode = AnchorAtStartEnd) const;
* class *QRegexFeedEngine* <feeder_type> : QRegexEngineBase
* QRegexFeedMatch<feeder_type> *exactMatch* (feeder_type begin, feeder_type end, AnchorMode anchorMode = AnchorAtStartEnd) const;
* QRegexFeedMatch<feeder_type> *findFirst* (feeder_type begin, feeder_type end, AnchorMode anchorMode = AnchorAtStartEnd) const;
* QRegexFeedMatch<feeder_type> *findLast* (feeder_type begin, feeder_type end, AnchorMode anchorMode = AnchorAtStartEnd) const;
* QList<QRegexFeedMatch<feeder_type> > *findAll* (feeder_type begin, feeder_type end, AnchorMode anchorMode = AnchorAtStartEnd) const;
* class *QRegexMatchBase* <T> // <T> in {<const ushort *>, <feeder_type>}
* virtual QString *cap* (int nth = 0) const = 0;
* QStringList *capturedTexts* () const;
* int *count* () const;
* bool *isValid* () const;
* int *length* (int nth = 0) const;
* int *pos* (int nth = 0) const;
* class *QRegexMatch* : QRegexMatchBase <const ushort *>
* QString *cap* (int nth = 0) const;
* class *QRegexFeedMatch* <feeder_type> : QRegexMatchBase<feeder_type>
* QString *cap* (int nth) const;
---+++ Usage example <code><pre>#include <qregexengine.h>
#include <qregexmatch.h>
#include <qregexfeedengine.h>
#include <qregexfeedmatch.h>
#include <stringlistregexfeeder.h>
int main()
{
QStringList list;
list << QString("hello");
list << QString("happy");
list << QString("world");
StringListRegexFeeder feeder(list);
// Plain string
QRegexEngine engineOne("(h[a-z]+){2}");
QRegexMatch matchOne = engineOne.exactMatch(list.join(""));
for (int j = 0; j < matchOne.count(); j++) {
qDebug() << "[" << j << "]" << matchOne.cap(j);
}
qDebug() << "";
// Complex feeder
QRegexFeedEngine<StringListRegexFeeder> engineTwo("(h[a-z]+){2}");
QRegexFeedMatch<StringListRegexFeeder> matchTwo = engineTwo.exactMatch(feeder.begin(), feeder.end());
for (int i = 0; i < matchTwo.count(); i++) {
qDebug() << "[" << i << "]" << matchTwo.cap(i);
}
qDebug() << "";
return 0;
}</pre></code>
</noautolink>