Regular expression libraries for QRegExp

Discussion:

Mark

2011-10-10 09:52:37 UTC

Hi,

Some time ago I've read something here about replacing Qt't QRegExp backend
with a existing library to get rid of the maintenance burden.
I can't find the discussion anymore but i thought i'd post this quite handy
link that compares a dozen regular expression libraries.

http://lh3lh3.users.sourceforge.net/reb.shtml

Note how extremely fast egrep is!
As for the syntax. In my opinion it should stay the same as the current
QRegExp thus the perl syntax.

My guess is that RE2 should be used since V8(?) also uses it and has the
perl syntax and thus should already be in Qt somewhere since V8 is there.

Good luck,
Mark

Giuseppe D'Angelo

2011-10-10 15:20:07 UTC

Permalink

Post by Mark
Hi,
Some time ago I've read something here about replacing Qt't QRegExp backend
with a existing library to get rid of the maintenance burden.
I can't find the discussion anymore but i thought i'd post this quite handy
link that compares a dozen regular expression libraries.
http://lh3lh3.users.sourceforge.net/reb.shtml

Good catch!

Post by Mark
Note how extremely fast egrep is!
As for the syntax. In my opinion it should stay the same as the current
QRegExp thus the perl syntax.

The problem is that QRegExp support is not even close to Perl regexps...

Post by Mark
My guess is that RE2 should be used since V8(?) also uses it and has the
perl syntax and thus should already be in Qt somewhere since V8 is there.
Good luck,
Mark

RE2 does not support positive/negative lookahead/lookbehind, which was
one of the issues to be solved by any proposed engine in Qt5. Even
QRegExp right now supports lookaheads...

BTW, I started to investigate some implementations and wrote some
notes down in my spare time on this wiki page:
http://developer.qt.nokia.com/wiki/Regexp_engine_in_Qt5
Any help is appreciated.

--
Giuseppe D'Angelo

Olivier Goffart

2011-10-10 15:33:32 UTC

Permalink

Post by Mark
Hi,
Some time ago I've read something here about replacing Qt't QRegExp backend
with a existing library to get rid of the maintenance burden.

One other conclusion was not to touch QRegExp, meaning keeping compatibility.
And those that need powerfull regexp can use the library and syntax they want.
(Notice that in C++11, there is std::regex)

Giuseppe D'Angelo

2011-10-10 16:13:47 UTC

Permalink

Post by Olivier Goffart

Post by Mark
Hi,
Some time ago I've read something here about replacing Qt't QRegExp backend
with a existing library to get rid of the maintenance burden.

If a solution doesn't come up in time for 5.0, could QRegExp still be
moved into a separate module so that it's possible to put a
replacement inside qtbase somewhere in the 5.x lifetime?

--
Giuseppe D'Angelo

Thiago Macieira

2011-10-10 16:28:01 UTC

Permalink

Post by Giuseppe D'Angelo
If a solution doesn't come up in time for 5.0, could QRegExp still be
moved into a separate module so that it's possible to put a
replacement inside qtbase somewhere in the 5.x lifetime?

Unknown. We need to figure out how to do that without breaking QString.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358

Mark

2011-10-10 17:41:29 UTC

Permalink

Post by Thiago Macieira

Unknown. We need to figure out how to do that without breaking QString.
Can't there be a "QRegExp2" that is made with the regex engine?

QRegExp2 should then be used for stock Qt classes (QString and what else is
using it).
QRegExp should be deprecated and dropped in Qt 6 ..?

I'm also guessing that using std::regex is not an option since not all
compilers that Qt support have std::regex support.

Just some brainstorming...

Mark

2011-10-11 08:17:23 UTC

Permalink

Post by Mark

Post by Thiago Macieira

Unknown. We need to figure out how to do that without breaking QString.
Can't there be a "QRegExp2" that is made with the regex engine?

QRegExp2 should then be used for stock Qt classes (QString and what else is
using it).
QRegExp should be deprecated and dropped in Qt 6 ..?
I'm also guessing that using std::regex is not an option since not all
compilers that Qt support have std::regex support.
Just some brainstorming...

Found something that is probably interesting for Thiago with his high
performance blog posts -- i like those posts a lot btw -- :)
I came across this link:

http://blog.phusion.nl/2010/12/06/efficient-substring-searching/
source: https://github.com/FooBarWidget/boyer-moore-horspool

That (Boyer-Moore or Boyer-Moore-Horspool) is probably very interesting in
speeding up string matching anywhere in Qt. If used in RE2 it would probably
speed it up a lot as well. I don't have numbers nor did i test it.. Just
assuming it ^_^

Thiago Macieira

2011-10-11 08:34:54 UTC

Permalink

Post by Mark
Found something that is probably interesting for Thiago with his high
performance blog posts -- i like those posts a lot btw -- :)
http://blog.phusion.nl/2010/12/06/efficient-substring-searching/
source: https://github.com/FooBarWidget/boyer-moore-horspool
That (Boyer-Moore or Boyer-Moore-Horspool) is probably very interesting in
speeding up string matching anywhere in Qt. If used in RE2 it would probably
speed it up a lot as well. I don't have numbers nor did i test it.. Just
assuming it ^_^

I believe that's the algorithm implemented in QStringMatcher, which
QString::indexOf uses only if the string is much larger than the substring
being searched.

In the case of QRegExp, I sat down with Lars and JoÃ£o yesterday to discuss a
bit and we talked about QRegExp. We don't know what to do with it because we
want to, at the same time:

* move the current engine out
* use a high-performance engine in QtCore
* not increase the footprint of QtCore by too much
* not restrict the platforms unnecessarily
* avoid code duplication
* avoid converting from UTF-16 to UTF-8 or, worse, local 8 bit

We're not going to get them all, that's for sure. On one hand, the V8 engine
is very performant, works on UTF-16 and avoids code duplication, but it
increases the footprint and restricts the platforms addressed. On the other,
PCRE is performant too, works almost everywhere and is small, but requires
UTF-8ââUTF-16 conversion.

I believe the standard WebKit has a PCRE engine inside, modified to work on
UTF-16. That's also an option, but it is code duplication and causes us to
have to maintain it.

So maybe the solution is a hybrid: dlopen V8 where it is available, fall back
to libpcre otherwise. And crash if none is found. That means using regexps
will cause a library to be loaded, one that can be as big as V8.

What does everyone think?

People used to say:
You had a problem and you used regular expressions. Now you have two.

With Qt 5, that will be three. :-)

(But you deserve it if you're using regular expressions for trivial things)

Sylvain Pointeau

2011-10-11 08:44:53 UTC

Permalink

Post by Thiago Macieira
We're not going to get them all, that's for sure. On one hand, the V8 engine
is very performant, works on UTF-16 and avoids code duplication, but it
increases the footprint and restricts the platforms addressed. On the other,
PCRE is performant too, works almost everywhere and is small, but requires
UTF-8ââUTF-16 conversion.
I believe the standard WebKit has a PCRE engine inside, modified to work on
UTF-16. That's also an option, but it is code duplication and causes us to
have to maintain it.

PCRE seems to be the better choice, no dependency on V8, and small
footprint.
furthermore you have the possibility to use it on UTF-16.
However I do not understand why you have to maintain it because it is part
of webkit.

Post by Thiago Macieira
So maybe the solution is a hybrid: dlopen V8 where it is available, fall back
to libpcre otherwise. And crash if none is found. That means using regexps
will cause a library to be loaded, one that can be as big as V8.

I would not do that, I would choose only PCRE and have the same behavior on
all platforms
Additionally V8 is a so huge dependency that it should be avoided for
standard bricks like regexp.

Best regards,
Sylvain

Robin Burchell

2011-10-11 08:49:02 UTC

Permalink

Hi,

On Tue, Oct 11, 2011 at 10:44 AM, Sylvain Pointeau

Post by Sylvain Pointeau
PCRE seems to be the better choice, no dependency on V8, and small
footprint.
furthermore you have the possibility to use it on UTF-16.
However I do not understand why you have to maintain it because it is part
of webkit.

PCRE is a seperate library, not part of webkit. Webkit has a forked
copy of PCRE that doesn't require the UTF16 -> UTF8 conversion (at
least, according to
http://lists.macosforge.org/pipermail/webkit-dev/2005-June/000082.html),
but that's not a standalone library, meaning someone would need to rip
it out, and make sure they keep it in sync with the copy in webkit,
etc. Or use the stock PCRE, and suffer the performance hit.

Post by Sylvain Pointeau
I would not do that, I would choose only PCRE and have the same behavior on
all platforms

I do sort of wonder how this is going to work on e.g. Windows - Thiago? :)

Post by Sylvain Pointeau
Additionally V8 is a so huge dependency that it should be avoided for
standard bricks like regexp.

I think this could do with some explaining as to what the implications
would actually be, before I'd run off saying it's a bad idea.

thanks,

Robin

Thiago Macieira

2011-10-11 09:07:22 UTC

Permalink

Post by Robin Burchell

Post by Sylvain Pointeau
I would not do that, I would choose only PCRE and have the same behavior on
all platforms

I do sort of wonder how this is going to work on e.g. Windows - Thiago?

It would use V8 on Windows.

Sylvain Pointeau

2011-10-11 09:50:34 UTC

Permalink

Post by Robin Burchell
PCRE is a seperate library, not part of webkit. Webkit has a forked
copy of PCRE that doesn't require the UTF16 -> UTF8 conversion (at
least, according to
http://lists.macosforge.org/pipermail/webkit-dev/2005-June/000082.html),
but that's not a standalone library, meaning someone would need to rip
it out, and make sure they keep it in sync with the copy in webkit,
etc. Or use the stock PCRE, and suffer the performance hit.

I understand now.
I read that the webkit team cannot upgrade their PCRE engine because of the
modifications...

What would be the performance impact if we use the UTF-8 PCRE?
Is it so huge? (and comparing also with the current QRegexp that I never
liked)

and what about the ICU regexp?

V8 seems the worst option here, to have such a dependency for just regexp.

Best regards,
Sylvain

Simon Hausmann

2011-10-11 10:02:50 UTC

Permalink

Post by Sylvain Pointeau

I understand now.
I read that the webkit team cannot upgrade their PCRE engine because of the
modifications...

Just to clarify this: WebKit does not use PCRE anymore (since at least 2 years IIRC).

WebKit can either use JavaScriptCore and V8. Both of them come with their own
regular expression engines that build on the respective JS infrastructure.

Simon

Sylvain Pointeau

2011-10-11 14:03:42 UTC

Permalink

Post by Simon Hausmann
Just to clarify this: WebKit does not use PCRE anymore (since at least 2 years IIRC).
WebKit can either use JavaScriptCore and V8. Both of them come with their own
regular expression engines that build on the respective JS infrastructure.
Simon

Many thanks for the clarification, I was not aware of it.

Thiago Macieira

2011-10-11 10:23:00 UTC

Permalink

Post by Sylvain Pointeau
and what about the ICU regexp?

The problem with ICU regexp is ICU.

We were discussing also whether we should use ICU or not. What we don't want
is to duplicate the Unicode tables, which are quite big. So the decision has
to be all-in for ICU or all-out.

We're afraid that an all-in for ICU would mean performance regressions in
Unicode-critical code like QString and the text formatting, due to the
overhead of function calls. This is an unconfirmed suspicion, no benchmarking
was done.

João Abecasis

2011-10-11 11:23:54 UTC

Permalink

Post by Sylvain Pointeau
V8 seems the worst option here, to have such a dependency for just regexp.

The fact is that for a lot of use cases V8 is already there. It's used by Qt Quick and Qt WebKit. A dependency on PCRE, in particular, would be added solely for the sake of QRegExp.

The major concern raised, I think, are applications that want to depend *only* on QtCore. Could we have a QT_NO_V8 #define that foregoes QRegExp altogether? Could we package a PCRE-based solution as a separate add-on for those developers that both want *only* QtCore *and* still require regular expression support?

Anyway, these are all open questions.

Cheers,

João

Thiago Macieira

2011-10-11 11:43:31 UTC

Permalink

The major concern raised, I think, are applications that want to depend only
on QtCore. Could we have a QT_NO_V8 #define that foregoes QRegExp
altogether? Could we package a PCRE-based solution as a separate add-on for
those developers that both want only QtCore and still require regular
expression support?

That wouldn't help in standard distributions, but it might for embedded
systems.

For a separate addon, we can provide the existing QRegExp. That should be done
anyway, in fact.

Mark

2011-10-11 08:47:20 UTC

Permalink

Post by Mark

Post by Mark
speeding up string matching anywhere in Qt. If used in RE2 it would

probably

Post by Mark
speed it up a lot as well. I don't have numbers nor did i test it.. Just
assuming it ^_^

I believe that's the algorithm implemented in QStringMatcher, which
QString::indexOf uses only if the string is much larger than the substring
being searched.
In the case of QRegExp, I sat down with Lars and JoÃ£o yesterday to discuss a
bit and we talked about QRegExp. We don't know what to do with it because we
* move the current engine out
* use a high-performance engine in QtCore
* not increase the footprint of QtCore by too much
* not restrict the platforms unnecessarily
* avoid code duplication
* avoid converting from UTF-16 to UTF-8 or, worse, local 8 bit
We're not going to get them all, that's for sure. On one hand, the V8 engine
is very performant, works on UTF-16 and avoids code duplication, but it
increases the footprint and restricts the platforms addressed. On the other,
PCRE is performant too, works almost everywhere and is small, but requires
UTF-8ââUTF-16 conversion.
I believe the standard WebKit has a PCRE engine inside, modified to work on
UTF-16. That's also an option, but it is code duplication and causes us to
have to maintain it.
So maybe the solution is a hybrid: dlopen V8 where it is available, fall back
to libpcre otherwise. And crash if none is found. That means using regexps
will cause a library to be loaded, one that can be as big as V8.
What does everyone think?
You had a problem and you used regular expressions. Now you have two.
With Qt 5, that will be three. :-)
(But you deserve it if you're using regular expressions for trivial things)
Isn't V8 having "libRE2.so/dll" (or something along those lines) somewhere

and just dlopen that?
And what exactly is V8 using now? PCRE or RE2?

Note: i'm not using regexp for any project, i'm just interested in this and
keeping the discussion going ^_^

Thiago Macieira

2011-10-11 09:29:21 UTC

Permalink

Post by Mark
Isn't V8 having "libRE2.so/dll" (or something along those lines) somewhere
and just dlopen that?

No, it's built-in, just like WebKit. If we want to use V8's RE engine, we need
the entire V8.

Post by Mark
And what exactly is V8 using now? PCRE or RE2?

JSRE.

Pau Garcia i Quiles

2011-10-11 09:45:56 UTC

Permalink

On Tue, Oct 11, 2011 at 10:34 AM, Thiago Macieira <***@kde.org> wrote:

I believe the standard WebKit has a PCRE engine inside, modified to work on

Post by Thiago Macieira
UTF-16. That's also an option, but it is code duplication and causes us to
have to maintain it.

Has anyone tried to submit that to Philip Hazel? If he is interested in
UTF-16 in PCRE (maybe as an optional compiling flag), problem solved.

Looks like a bad idea to me.

Doesn't that mean regular expressions might behave slightly different
depending on the platform?

That would make debugging difficult.

And that's assuming the libpcre and v8 that would be dlopen'd would be the
ones bundled with QtCore. If they are the system ones (and that's what would
happen in Linux distributions, because distros force packages to use
packaged third-party dependencies), then replace "that makes debugging
difficult" with "that makes debugging crazy": different compilation flags
and/or patches for V8 or PCRE in RHEL might mean it behaves different than
Debian, Ubuntu or OS X.

--
Pau Garcia i Quiles
http://www.elpauer.org
(Due to my workload, I may need 10 days to answer)

Stephen Kelly

2011-11-09 14:14:17 UTC

Permalink

Post by Thiago Macieira
I believe the standard WebKit has a PCRE engine inside, modified to work on
UTF-16. That's also an option, but it is code duplication and causes us to
have to maintain it.
Has anyone tried to submit that to Philip Hazel? If he is interested in UTF-16

in PCRE (maybe as an optional compiling flag), problem solved.

I have now:

http://bugs.exim.org/show_bug.cgi?id=1049

Jason A. Donenfeld

2011-12-01 09:39:59 UTC

Permalink

I have started working on the 16 bit: first I added 16 bit library support for
the Unix build system and implemented some utf16 utility functions. Philip and
me aggreed that this work should go in a branch to keep the trunk clean,
although not yet decided where/how.

http://bugs.exim.org/show_bug.cgi?id=1049#c33

So if this is delivered, then PCRE for QtCore it is?

in PCRE (maybe as an optional compiling flag), problem solved.
http://bugs.exim.org/show_bug.cgi?id=1049
_______________________________________________
Qt5-feedback mailing list
http://lists.qt.nokia.com/mailman/listinfo/qt5-feedback

Harri Porten

2011-10-11 10:23:17 UTC

Permalink

As these two are surely not compatible this does not sound like an
attractive solution.

While I've been fighting with the UTF8-character of PCRE myself I still
liked the library: it's relatively small, extremely portable, independant.
*And* has an author that is happy to apply enhancements. I bet he wouldn't
have minded Apple's UTF-16 patches either if done right.

One general thing to stress when talking about the reuse of RegExp engines
meant for JS interpreters: these work differently than e.g. Perl and
current RegExp. Maybe in some corner cases. But still a thing to consider
as it might break code and look odd to non-JS users. Hmmm. Didn't want to
wake up the topic of the other thread again ;)

Harri.

s***@accenture.com

2011-10-11 11:05:59 UTC

Permalink

Post by Thiago Macieira
So maybe the solution is a hybrid: dlopen V8 where it is available,
fall back to libpcre otherwise. And crash if none is found. That means
using regexps will cause a library to be loaded, one that can be as big
as V8.
What does everyone think?

Loading and relocating a huge library seems like something too expensive for evaluating a regexp.
Even though QML applications have already paid the cost at application startup time.

An UTF16 API is required for the engine though, as all QStrings are UTF16.

________________________________
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.

Mark

2011-10-11 11:16:35 UTC

Permalink

Post by s***@accenture.com

Post by Thiago Macieira
So maybe the solution is a hybrid: dlopen V8 where it is available,
fall back to libpcre otherwise. And crash if none is found. That means
using regexps will cause a library to be loaded, one that can be as big
as V8.
What does everyone think?

it. To bad that it also comes with webkit but loading that just for regular
expressions is to much IMHO.

Sylvain Pointeau

2011-10-11 14:05:42 UTC

Permalink

Post by s***@accenture.com
An UTF16 API is required for the engine though, as all QStrings are UTF16.

How much does it cost to convert UTF-16 to UTF-8 ?
Is it really a show-stopper for choosing PCRE?

s***@accenture.com

2011-10-11 14:20:44 UTC

Permalink

Post by Sylvain Pointeau
How much does it cost to convert UTF-16 to UTF-8 ?
Is it really a show-stopper for choosing PCRE?

It's not a show stopper.
However it needs to be included in any benchmarking to compare candidate engines.
I.E. compare the time to evaluate a regexp on a QString.
If the engine API is UTF16, then QString::constData can be used with no additional cost.
I suspect the conversion cost is mainly in the memory allocation.

--
Communications with Accenture or any of its group companies ("Accenture Group") including telephone calls and emails (including content), may be monitored by our systems for the purposes of security and the assessment of internal compliance with company policy. Accenture Group does not accept service by e-mail of court proceedings, other processes or formal notices of any kind.

Accenture means Accenture (UK) Limited (registered number 4757301), Accenture Technology Solutions Limited (registered number 4442596), or Accenture HR Services Limited (registered number 3957974), all registered in England and Wales with registered addresses at 30 Fenchurch Street, London EC3M 3BD, as the case may be.

From: Sylvain Pointeau [mailto:***@gmail.com]
Sent: Tuesday, October 11, 2011 15:06
To: Kearns, Shane
Cc: ***@kde.org; qt5-***@qt.nokia.com
Subject: Re: [Qt5-feedback] Regular expression libraries for QRegExp

An UTF16 API is required for the engine though, as all QStrings are UTF16.

________________________________
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.

Thiago Macieira

2011-10-11 14:21:46 UTC

Permalink

Post by Sylvain Pointeau

Post by s***@accenture.com
An UTF16 API is required for the engine though, as all QStrings are UTF16.

How much does it cost to convert UTF-16 to UTF-8 ?
Is it really a show-stopper for choosing PCRE?

It costs infinitely more to convert to UTF-8 than to do nothing. The conversion
takes non-zero time and the non-conversion takes no time at all. Division by
zero.

We could optimise the UTF-8 encoder -- in fact, I already have in the QUrl
refactor work because I needed that.

What worries me more is to extract offset information. Suppose the following
code:

int pos = regexp.indexIn(str);

pcre_exec will return an offset of the match start and end for the whole
match, as well as a pair of integers for each capture. Note what the manual
says (pcreapi(3)):

Note: these values are always byte offsets, even in UTF-8 mode. They are not
character counts.

So we need to convert byte offsets in UTF-8 back to UTF-16 codepoint offsets. I
can't think of any non-linear algorithm: we need to scan forward and count
bytes and QChars.

João Abecasis

2011-10-11 11:15:06 UTC

Permalink

In the case of QRegExp, I sat down with Lars and João yesterday to discuss a
bit and we talked about QRegExp. We don't know what to do with it because we
* move the current engine out
* use a high-performance engine in QtCore
* not increase the footprint of QtCore by too much
* not restrict the platforms unnecessarily
* avoid code duplication
* avoid converting from UTF-16 to UTF-8 or, worse, local 8 bit
We're not going to get them all, that's for sure. On one hand, the V8 engine
is very performant, works on UTF-16 and avoids code duplication, but it
increases the footprint and restricts the platforms addressed. On the other,
PCRE is performant too, works almost everywhere and is small, but requires
UTF-8←→UTF-16 conversion.
I believe the standard WebKit has a PCRE engine inside, modified to work on
UTF-16. That's also an option, but it is code duplication and causes us to
have to maintain it.
So maybe the solution is a hybrid: dlopen V8 where it is available, fall back
to libpcre otherwise. And crash if none is found. That means using regexps
will cause a library to be loaded, one that can be as big as V8.
What does everyone think?

I'll explicitly state two points that I think should be kept in mind in this discussion.

As long as QtWebKit, QtScript, QtDeclarative are using V8, using that as the backend for QRegExp's-evolution-or-replacement as well allows us to have a single engine, syntax and implementation across the Qt stack. I think this is a very desirable feature. Of course, it has to be weighed with the other issues at stake.

Secondly, if we decide to use a different RE engine for QRegExp's-evolution-or-replacement, picking different engines according to platform or configuration opens the door to subtle cross-platform differences. It would be best if we can avoid them. Particularly so in Qt Core.

Cheers,

João

Andre Somers

2011-10-11 17:11:02 UTC

Permalink

Post by JoÃ£o Abecasis

I'll explicitly state two points that I think should be kept in mind in this discussion.
As long as QtWebKit, QtScript, QtDeclarative are using V8, using that as the backend for QRegExp's-evolution-or-replacement as well allows us to have a single engine, syntax and implementation across the Qt stack. I think this is a very desirable feature. Of course, it has to be weighed with the other issues at stake.
Secondly, if we decide to use a different RE engine for QRegExp's-evolution-or-replacement, picking different engines according to platform or configuration opens the door to subtle cross-platform differences. It would be best if we can avoid them. Particularly so in Qt Core.

Would it be conceptually feasible to separate V8's engine from V8
itself, and make V8 link against the separated out engine? Would such a
change in V8 be accepted upstream? If so, then it would open up the
perspective of just using V8's regexp engine, and don't load the rest of
it if not needed...

/me is just daydreaming here...

André

Thiago Macieira

2011-10-11 21:39:32 UTC

Permalink

Post by Andre Somers
Would it be conceptually feasible to separate V8's engine from V8
itself, and make V8 link against the separated out engine? Would such a
change in V8 be accepted upstream? If so, then it would open up the
perspective of just using V8's regexp engine, and don't load the rest of
it if not needed...
/me is just daydreaming here...

a***@nokia.com

2011-10-12 13:27:03 UTC

Permalink

Hi,

While V8 is definitely awesome and sharing its regexp implementation would likewise be awesome, it really isn't feasible.

V8 isn't a collection of independent features, loosely coupled together like Qt is - it is a tightly integrated, highly optimised whole. You cannot simply link the V8 library and then tease out its regexp implementation. The input, output and internal parameters, the code generator and the code cache used by their regexp implementation are all expressed in terms of V8 heap primitives. This not only means that all Qt's input types - like QString - would need to be reallocated inside the V8 heap, it also means that you need to run the V8 garbage collector to clean these resources up. This in turn requires bootstrapping the rest of V8. In short, you are running a JavaScript engine.

Cheers,

Aaron

Post by Thiago Macieira

I'm not sure. In my quick inspection into V8 about a month ago, I didn't pay
attention to the RE code.
However, I understand that V8 automatically compiles the RE straight into
native code, which means the V8 infrastructure and assembler must be present.
At the very least, this is going to be a large duplication of code between V8-
JS and V8-RE.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
_______________________________________________
Qt5-feedback mailing list
http://lists.qt.nokia.com/mailman/listinfo/qt5-feedback

Sylvain Pointeau

2011-10-12 14:32:55 UTC

Permalink

Post by a***@nokia.com
Hi,
While V8 is definitely awesome and sharing its regexp implementation would
likewise be awesome, it really isn't feasible.
V8 isn't a collection of independent features, loosely coupled together
like Qt is - it is a tightly integrated, highly optimised whole. You cannot
simply link the V8 library and then tease out its regexp implementation.
The input, output and internal parameters, the code generator and the code
cache used by their regexp implementation are all expressed in terms of V8
heap primitives. This not only means that all Qt's input types - like
QString - would need to be reallocated inside the V8 heap, it also means
that you need to run the V8 garbage collector to clean these resources up.
This in turn requires bootstrapping the rest of V8. In short, you are
running a JavaScript engine.

Comparing to this fact, UTF-16 to UTF-8 seems trivial :-)

I don't see any good alternative to PCRE, do you have something else in
mind?

Best regards,
Sylvain

a***@nokia.com

2011-10-13 05:27:18 UTC

Permalink

Hi,

On 13/10/2011, at 12:55 AM, ext ***@gmail.com<mailto:***@gmail.com> wrote:

Hi Aaron,

I wonder if this monlithic design was needed to give V8 its traits ?

Yes. V8 is designed to run JavaScript super fast, at the expense of everything else. It is elegantly structured internally, but it is very clearly a JavaScript engine first.

It might be more work, but we could try and "borrow" implementaions, a'la the open source way.

I don't know what this means.

Cheers,

Aaron

-Sivan

On 12/10/11 15:27 ***@nokia.com<mailto:***@nokia.com> wrote:

Hi,

While V8 is definitely awesome and sharing its regexp implementation would likewise be awesome, it really isn't feasible.

V8 isn't a collection of independent features, loosely coupled together like Qt is - it is a tightly integrated, highly optimised whole. You cannot simply link the V8 library and then tease out its regexp implementation. The input, output and internal parameters, the code generator and the code cache used by their regexp implementation are all expressed in terms of V8 heap primitives. This not only means that all Qt's input types - like QString - would need to be reallocated inside the V8 heap, it also means that you need to run the V8 garbage collector to clean these resources up. This in turn requires bootstrapping the rest of V8. In short, you are running a JavaScript engine.

Cheers,

Aaron

Post by Thiago Macieira

I'm not sure. In my quick inspection into V8 about a month ago, I didn't pay
attention to the RE code.
However, I understand that V8 automatically compiles the RE straight into
native code, which means the V8 infrastructure and assembler must be present.
At the very least, this is going to be a large duplication of code between V8-
JS and V8-RE.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org<http://kde.org>
Software Architect - Intel Open Source Technology Center
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
_______________________________________________
Qt5-feedback mailing list
http://lists.qt.nokia.com/mailman/listinfo/qt5-feedback

Sylvain Pointeau

2011-10-13 09:52:37 UTC

Permalink

Could Re2 be a good candidate?

"RE2 supports submatch extraction, but not backreferences.
If you absolutely need backreferences and generalized assertions, then RE2
is not for you"

I find nice that it uses DFA, so it guaranties that the execution time is
constant.

Giuseppe D'Angelo

2011-10-13 14:50:28 UTC

Permalink

I was trying to collect all "technical" comments about the engines in
the page [1] linked by Robin (my fault, I posted that in the wrong
thread), but I won't have a stable Internet connection for another
week...
Either way: RE2 [2] has no lookhead/lookbehind assertions, which was
one of the issues "to be solved" [3] by a proposed engine (and to be
honest even QRegExp in Qt4 supports lookaheads, and I don't see any
reason to *drop* features instead of gaining new ones).

[1] http://developer.qt.nokia.com/wiki/Regexp_engine_in_Qt5
[2] http://code.google.com/p/re2/wiki/Syntax
[3] Cf. http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001054.html

Cheers,

--
Giuseppe D'Angelo

Sylvain Pointeau

2011-10-13 15:11:43 UTC

Permalink

Post by Giuseppe D'Angelo
I was trying to collect all "technical" comments about the engines in
the page [1] linked by Robin (my fault, I posted that in the wrong
thread), but I won't have a stable Internet connection for another
week...

so if I summarize:

RE2:
+ UTF-16
- look ahead

PCRE:
- UTF-8

is the lookahead definitely a show stopper? if yes then we only have PCRE

On another hand, PCRE is quite the standard and I would be happy to have it.
On the other hand, I understood that UTF-8 <-> UTF-16 + index could be
really annoying for Qt (UTF-16)

Performance vs functionality?

Best regards,
Sylvain

Alex Strickland

2011-10-13 19:11:17 UTC

Permalink

Post by Sylvain Pointeau
+ UTF-16
- look ahead

On the briefest look I saw no evidence that it supports wide strings, am
I wrong?

--
Regards
Alex

Stefan Majewsky

2011-10-14 18:44:27 UTC

Permalink

Post by Giuseppe D'Angelo
Either way: RE2 [2] has no lookhead/lookbehind assertions, which was
one of the issues "to be solved" [3] by a proposed engine (and to be
honest even QRegExp in Qt4 supports lookaheads, and I don't see any
reason to *drop* features instead of gaining new ones).

+1

I'd also trade compatibility for speed. I have a LaTeX logfile parser
which uses look(?:ahead|behind) assertions excessively.

Greetings
Stefan

Sylvain Pointeau

2011-10-15 18:40:40 UTC

Permalink

Post by Stefan Majewsky

+1
I'd also trade compatibility for speed. I have a LaTeX logfile parser
which uses look(?:ahead|behind) assertions excessively.

PCRE is probably then the only option!

Best regards,
Sylvain

l***@nokia.com

2011-10-17 08:27:31 UTC

Permalink

Post by Stefan Majewsky

+1
I'd also trade compatibility for speed. I have a LaTeX logfile parser
which uses look(?:ahead|behind) assertions excessively.
PCRE is probably then the only option!

Yes, sounds almost like it. The good thing about it is that it also has a
JS compatible mode that we can then offer for people that would like to
have compatible syntax between C++ and QML.

Cheers,
Lars

João Abecasis

2011-10-12 12:36:37 UTC

Permalink

I would like to avoid doing a lot of work for the benefit of contrived use cases. For users of Qt as a platform there isn't much value in having the RE engine cleanly isolated from everything else. On the other hand, there are probably benefits to having it integrated with the JS engine for those who also use it.

Granted, it's a double-edged sword... or small swiss-army knife.

But then the real question shouldn't be "what if I don't want a JS engine in my application?", but rather "what's the cost?".

Also consider that, as with any project, the chances of getting stuff integrated upstream usually correlates with value and usefulness of said stuff to upstream project and their users. In the competitive landscape of JS engines I would expect our bargaining power to be *very*non-existent...

Cheers,

João

André Pönitz

2011-10-12 12:47:16 UTC

Permalink

Post by JoÃ£o Abecasis

Indeed.

So, do we know what the costs would be?

I, for one, wouldn't mind to know about the impact on application
debugging. I.e. how many additional symbols (if any...), startup time
when debugging a "normal" application etc.

Andre'

Thiago Macieira

2011-10-12 14:14:01 UTC

Permalink

Post by AndrÃ© PÃ¶nitz
So, do we know what the costs would be?
I, for one, wouldn't mind to know about the impact on application
debugging. I.e. how many additional symbols (if any...), startup time
when debugging a "normal" application etc.

QtV8 is:

stripped file size = 4838072 = QtCore + QtNetwork + QtPrintSupport

debug symbol count = 42272 =~ QtCore + QtGui + QtNetwork

dynamic symbol count = 446 =~ QtDBus

class count = 583 =~ QtWidgets =~ QtCore + QtGui + QtNetwork + QtSvg
(classes with virtual functions)

read-only size = 0x43720a =~ QtCore + QtNetwork

read-write size = 0x65838 =~ QtCore + QtGui + QtNetwork + QtXmlPatterns
or ~ 0.25 * QtWebKit

relocation count = 47923 =~ QtXmlPatterns + QtWidgets
or ~ 0.25 * QtWebKit

This is a Qt 5 post-refactor build, using gcc 4.6.0 -O3, on x86-64. QtCore
still contains QRegExp and it also contains my new QUrlQuery.

Pau Garcia i Quiles

2011-10-12 14:25:18 UTC

Permalink

Hi,

Maybe someone has already proposed this, but just in case:

In QtCore, have only QRegExp. Maybe rename it to QSimpleRegExp. Simple
regexp support for console-only or GUI-less applications. Should be enough.

In QtGui, or QtWebKit, or QtDeclarative (I'm not sure which one would be the
best choice), have a new QRegExp using the V8 regexp engine.

qt4to5 would rename all instances of QRegExp to QSimpleRegExp.

Post by JoÃ£o Abecasis

I would like to avoid doing a lot of work for the benefit of contrived use
cases. For users of Qt as a platform there isn't much value in having the RE
engine cleanly isolated from everything else. On the other hand, there are
probably benefits to having it integrated with the JS engine for those who
also use it.
Granted, it's a double-edged sword... or small swiss-army knife.
But then the real question shouldn't be "what if I don't want a JS engine
in my application?", but rather "what's the cost?".
Also consider that, as with any project, the chances of getting stuff
integrated upstream usually correlates with value and usefulness of said
stuff to upstream project and their users. In the competitive landscape of
JS engines I would expect our bargaining power to be *very*non-existent...
Cheers,
João
_______________________________________________
Qt5-feedback mailing list
http://lists.qt.nokia.com/mailman/listinfo/qt5-feedback

--
Pau Garcia i Quiles
http://www.elpauer.org
(Due to my workload, I may need 10 days to answer)

Robin Burchell

2011-10-12 15:08:54 UTC

Permalink

Post by Pau Garcia i Quiles
In QtCore, have only QRegExp. Maybe rename it to QSimpleRegExp. Simple
regexp support for console-only or GUI-less applications. Should be enough.

this is not just about missing functionality: the current QRegExp is
broken[1], and renaming it won't fix that, nor will it discourage
people from using it.

Apart from the purist's line of thought, it's also very, very slow in
a number of common cases (and really, really slow in some
not-so-common ones). Getting rid of it as much as possible (and
encouraging porting to something less stupid) is something that really
should be done, and now's pretty much the only chance anytime soon to
do that, as leaving it in place implies that it is both good (which it
isn't) and maintained (which it definitely isn't).

thanks,

Robin

[1]: http://developer.qt.nokia.com/wiki/Regexp_engine_in_Qt5

Pau Garcia i Quiles

2011-10-12 15:17:43 UTC

Permalink

Post by Pau Garcia i Quiles

Post by Pau Garcia i Quiles
In QtCore, have only QRegExp. Maybe rename it to QSimpleRegExp. Simple
regexp support for console-only or GUI-less applications. Should be

enough.
this is not just about missing functionality: the current QRegExp is
broken[1], and renaming it won't fix that, nor will it discourage
people from using it.

The API can be fixed

Post by Pau Garcia i Quiles
Apart from the purist's line of thought, it's also very, very slow in
a number of common cases (and really, really slow in some
not-so-common ones).

There is nothing wrong with that.

You want do depend only on QtCore? Fine: you no lookbehind for you, some
cases will be very slow. On the plus side: no dependency on V8, no UTF-16 to
UTF-8 conversion, etc

You want advanced regexp features and a very performant regexp engine?
You'll need to depend on some other module. This is perfectly fine for GUI
applications. In the end, anyone who does GUI will use the new regexp
engine; anyone who does not do GUI will either use QSimpleRegExp or find an
alternative (PCRE, RE2, etc)

Post by Pau Garcia i Quiles
Getting rid of it as much as possible (and
encouraging porting to something less stupid) is something that really
should be done, and now's pretty much the only chance anytime soon to
do that, as leaving it in place implies that it is both good (which it
isn't) and maintained (which it definitely isn't).

Unfortunately, it looks like finding a performant and feature-complete
regexp engine without introducing a huge dependency is not that easy. That
huge dependency is unacceptable in QtCore.

--
Pau Garcia i Quiles
http://www.elpauer.org
(Due to my workload, I may need 10 days to answer)

Robin Burchell

2011-10-12 15:26:38 UTC

Permalink

On Wed, Oct 12, 2011 at 5:17 PM, Pau Garcia i Quiles

Post by Pau Garcia i Quiles

Post by Robin Burchell
this is not just about missing functionality: the current QRegExp is
broken[1], and renaming it won't fix that, nor will it discourage
people from using it.

The API can be fixed

that requires a maintainer, which it does not have

Post by Pau Garcia i Quiles

Post by Robin Burchell
Apart from the purist's line of thought, it's also very, very slow in
a number of common cases (and really, really slow in some
not-so-common ones).

There is nothing wrong with that.

I disagree. Or do you really want to deliberately handicap applications?

Post by Pau Garcia i Quiles
You want advanced regexp features and a very performant regexp engine?
You'll need to depend on some other module. This is perfectly fine for GUI
applications. In the end, anyone who does GUI will use the new regexp
engine; anyone who does not do GUI will either use QSimpleRegExp or find an
alternative (PCRE, RE2, etc)

Nothing stops the adoption of one of those new engines as the backend
for the new regexp-for-Qt, and in fact, I'd encourage that over
keeping the existing implementation, even if that did mean a utf16
conversion - it'd probably still be faster for most cases
(benchmarking needed, obviously), and nothing stops someone adding
utf16 support to that engine, or worst case [in theory] precludes
changing that engine later, should one more suitable pop up.

don't get me wrong here: I'm as sceptical as you are about requiring
V8 or something else for regex, I have written a lot of headless code
using Qt, and I'm not really getting happy warm thoughts about that,
even without actually doing firm measurement on what the impact would
be. But I don't think the status quo is a good one - perhaps because
I've actually had to step in and rip out QRegExp in places where
performance was an issue in order to attain responsive applications,
60fps drawing, etc

thanks,

Robin

Thiago Macieira

2011-10-12 15:39:49 UTC

Permalink

Post by Pau Garcia i Quiles
You want do depend only on QtCore? Fine: you no lookbehind for you, some
cases will be very slow. On the plus side: no dependency on V8, no UTF-16 to
UTF-8 conversion, etc

I don't want a class in QtCore that can be qualified as "meh".

Either it's good and people should use it for the vast majority of use-cases
where that technology is applicable, or it has no business being in QtCore.
QtCore is used in 100% of Qt applications (by definition).

And it needs a maintainer.

The only reason to keep it as-is right now is to keep the Qt 4 source
compatibility.

PS: if you find other classes that are "meh" in QtCore, speak up. We need to
see how to improve it and what the impact is. Post in new threads, please.

PPS: QSettings is also "meh" but I've already posted about it.

Thiago Macieira

2011-10-12 15:32:22 UTC

Permalink

Post by Pau Garcia i Quiles
Hi,
In QtCore, have only QRegExp. Maybe rename it to QSimpleRegExp. Simple
regexp support for console-only or GUI-less applications. Should be enough.
In QtGui, or QtWebKit, or QtDeclarative (I'm not sure which one would be the
best choice), have a new QRegExp using the V8 regexp engine.
qt4to5 would rename all instances of QRegExp to QSimpleRegExp.

That misses one point: we don't want to keep the current RE engine in QtCore.

It's the third largest file (not counting the generated qunicodetables.cpp and
we discussed moving the date-time parser out of qdatetime.cpp) and QRegExp was
barely ported from Qt 3 to Qt 4 in the first place. It wasn't usable from more
than one thread until Qt 4.4, for example. It doesn't implement any known RE
standard. No one has stepped up for maintaining it, much less adding features
people seem to want (forward look-aheads, for example).

Outside of merges, doc changes and generic maintenance, the last change to
qregexp.cpp was March 1st, 2010 (not 2011). There are a total of 8 such
commits in the Qt 4.x repository (that is, since May 2009) and 4 commits in Qt
5.

What's more, it has a flawed API because it mixes the regular expression itself
with the results of the matching. You can't create a QRegExp object of a
complex RE, hoping to share the pre-compiled internals, and use it from
multiple threads. You need to copy the QRegExp object before matching and hope
that it's smart enough to split the matching from the engine.

And until Qt 4.5, it also had this bad behaviour:

const QRegExp rx(pattern);
if (str.indexOf(rx) != -1) {
/* const rx was modified here!! */
}

We want to move it completely out of QtCore so that QtCore can get a better
replacement in the future. But we also don't want to break too badly the tons
of source code that must be out there using regexp (whether they should be
using regexps or not).

Girish Ramakrishnan

2011-10-10 16:20:57 UTC

Permalink

Post by Olivier Goffart

Post by Mark
Hi,
Some time ago I've read something here about replacing Qt't QRegExp backend
with a existing library to get rid of the maintenance burden.

Wouldn't this prevent a nicer tight integration with other Qt classes
like QString, QVariant etc? That is to say it would be nice if we
could use a more powerful regexp with these classes.

Girish

Alexander Neundorf

2011-10-12 18:04:01 UTC

Permalink

Post by Olivier Goffart

Post by Mark
Hi,
Some time ago I've read something here about replacing Qt't QRegExp
backend with a existing library to get rid of the maintenance burden.

One other conclusion was not to touch QRegExp, meaning keeping
compatibility. And those that need powerfull regexp can use the library
and syntax they want. (Notice that in C++11, there is std::regex)

This sounds like a good plan to me.
For all my purposes Qt regexps were powerful and fast enough, and they were
maybe not high-end, but definitely not trivial anymore.
IMO they are good enough as they are, especially keeping the (close to) 100%
source compatiblity promise in mind.

Alex

Tomasz Siekierda

2011-10-12 14:47:38 UTC

Permalink

Renaming QRegExp to QSimpleRegExp seems nice and reasonable, although
it might confuse developers as to which engine they are actually
using. A mild +1 from me :)

This way, QtCore's reg exp:
a) would be as cross-platform as today,
b) would not break QString.

While at the same time, more powerful alternative would sit in another module.

sierdzio

Michael Hasselmann

2011-10-12 15:02:48 UTC

Permalink

Post by Tomasz Siekierda
Renaming QRegExp to QSimpleRegExp seems nice and reasonable, although
it might confuse developers as to which engine they are actually
using. A mild +1 from me :)

And then developers get the idea that simple = fast, because of less
complexity, right? You'd risk to end up with developers using the wrong
impl for the wrong reason.

Better to leave the name as-is and introduce the faster/more efficient
impl (in the other module) as QFastRegExp.

regards,
Michael

Tomasz Siekierda

2011-10-12 15:24:59 UTC

Permalink

True, but giving the "powerful" engine a basic name like QRegExp will
send developers a strong message that this is the one that is
preferred/ should be used, which will create a slow tidal wave of
going away from current implementation, which is what people here
want, at least as far as I get it.

Anyway, I'm not sure what is best here myself, just thought it good to
back this solution, as it has some visible benefits (OK, drawbacks
too...).

s.

Post by Michael Hasselmann

Post by Tomasz Siekierda
Renaming QRegExp to QSimpleRegExp seems nice and reasonable, although
it might confuse developers as to which engine they are actually
using. A mild +1 from me :)

And then developers get the idea that simple = fast, because of less
complexity, right? You'd risk to end up with developers using the wrong
impl for the wrong reason.
Better to leave the name as-is and introduce the faster/more efficient
impl (in the other module) as QFastRegExp.
regards,
Michael