Thoughts on Solr as a catalog replacement and the catalog in general

classic Classic list List threaded Threaded
29 messages Options
12
Hanno Schlichting-4 Hanno Schlichting-4
Reply | Threaded
Open this post in threaded view
|

Thoughts on Solr as a catalog replacement and the catalog in general

Hi there,

I know various people have been looking into using Solr as a catalog
replacement. After spending some considerable time on this for a
project, I have some observations. Maybe others have had other
experiences they'd like to share.

Note that for any small sites you generally won't run into this. Sites
up to a couple of 10.000 documents usually work fine. This really
applies to sites which have 100.000+ documents (objects in the
catalog). Note that for any non-trivial site I'd strongly suggest to
use experimental.catalogqueryplan. We couldn't run our large sites
without this and it makes the catalog a much less urgent bottleneck.

The good news:

With collective.solr it is pretty easy to use Solr as a backend for
full text search queries. Replacing "SearchableText" queries is almost
always a win, as the ZCTextIndex implementation is embarrassingly
simple and ineffective at the same time. Additional features like
facets are a nice bonus. There's small issues being worked on all the
time and things like efficient batching being implemented, but in
general it works pretty well.

More generally following the Solr community I think they already have
a fantastic product, a thriving community and lots of good things
planned for the next releases. They merged development with their base
Lucene library, have good plans for a 2.0 release and much more.
There's a number of CMS systems which have opted for Solr and I think
Plone should do this in the mid-term as well. Search is just too
complicated a problem to solve on our own. What we offer is search
technology from 10 years back - the world has moved on.

The bad news:

Solr is tailored to getting content updates in batches and doing
"commits" to the base index infrequently. They currently suggest to
expect a delay of at least a minute, before updates to the index are
available in the search results. There's a whole lot of tweaking you
can do to caches and there's improvements planned for "near realtime
search" (http://wiki.apache.org/solr/NearRealtimeSearch). But
generally Solr won't ever be able to make a guarantee on returning
data in searches in time. The way to get better search response times
is generally by doing fewer content updates and living with longer
delays. For full text searches this is a sensible strategy.

The catalog on the other hand is integrated into the rest of the
database transactions and you can expect to do a change to the catalog
and see those results reflected in a catalog search in the same
transaction. The collective.indexing project tried to improve this and
move all catalog indexing to the end of the transaction to avoid any
duplicate operations. Unfortunately it had to bail out of most
optimizations, as there's too much code that relies on these
in-transaction updates.

We tried dispatching all queries to Solr instead of to the catalog in
one project. This was only for ZEO clients serving anonymous traffic,
so there would be no confusion about edit operations not taking affect
immediately. Even with all Solr queries taking 2ms on average,
rendering Plone pages as a whole has seen an increase in time of 100%
and more. With Solr every catalog query suddenly has a small network
delay and an overhead for parsing XML responses into Python data.
Currently there's no cache on the ZEO client side for these results,
so you have to transport all data over and over again.

My conclusion for the moment is, that using Solr as a transparent
replacement for the entire catalog is not possible. I think what we
need to do first, is to change Plone code and its use of the catalog.
I had some naive hope of being able to avoid this and getting a magic
silver bullet ;)

1. There's a number of catalog searches that can always live with a
short delay. Examples are some portlets ("latest five news", "recent
events", "review queue") - where a potential delay of a minute will be
acceptable. Other portlets like the navigation tree generally need to
be updated immediately. I think we need to add markers to the code,
which signals this to the underlying layers. Say add "async=True" as
an argument to the catalog query as a simple first step. The
underlying layer can then decide which search backend to use and for
"async" searches rely on Solr.

2. Add information about the "columns" that are used from a specific
query to the catalog call. This is equivalent of saying "select a, b
from foo" vs. "select * from foo". Solr has a similar mechanism, which
allows to specify the data one expects on the returned flares
(brains). This can limit the amount of data that needs to be
transported in significant ways. In one project the full fledged
metadata of an object is about 1kb of XML each - which quickly grows
out of hand for multiple queries of 10-20kb for each page view and
parsing that data from XML. If all you need is "id", "title" and
"path" you shouldn't have to transport and parse the rest.

3. Carefully review all places we do catalog queries and consider
replacing them with non-catalog solutions. I know some people started
working on this and I think we need to do it. With Plone 4, blob
support and the new plone.folder package there's not many reasons
anymore to call the catalog all the time. Filtering for security
restrictions is the one thing we need to find a convenient and
performant API for. But generally all operations that work "on the
current or one specific context / folder" should be done without the
catalog.

4. Encourage any experiments with using a SQL database as a catalog
backend. I have heard people talk about it, but never seen anything
concrete. An SQL database can be integrated into the same transaction
as the ZODB and deliver a strong guarantee on consistency. This could
be used for the kinds of queries where this strong guarantee is
required. The main trick for this is, that an SQL database can do
queries on the server side. With ZEO we currently transport the entire
catalog of potentially some hundreds of mb to each application
instance, do a query on it and use a couple bytes at the end. This
scales to a couple of ZEO clients, but we are running sites which have
10 or 20 ZEO clients. Doing all calculations client side is just not a
good model there. The ZODB has no query language that can be used to
do server side queries.

5. Push forward the __parent__ pointer move. This will help with those
queries that need to do a brain.getObject() today or where adding
metadata to the catalog would slow it down in general. In these cases
the catalog (or external backend) could simply return the list of
poids of the objects as a result. Loading a single object from the
ZODB is extremely fast and harnesses the ZODB cache. Currently we need
to traverse to the object from the root based on its path, so it gets
a proper Acquisition context to work with. The __parent__ pointer
changes avoid these and you get a fully functional object from a
"connection.get(poid)" call without any more wrapping or traversing.

6. Consider splitting out a catalog used for navigation from the
general catalog. The navigation tree and sitemap rely on indexing the
path and doing queries on that. But they generally need little
metadata to be indexed. If we could avoid indexing the path in the
normal catalog, moving an object inside the site hierarchy wouldn't
invalidate the larger catalog anymore. This relies on getting a good
UID story to use as a unique key in the catalog. As long as it uses
"getPhysicalPath" internally as its unique key, there's no real
advantage.


That's it for now. If you have any comments, they are much
appreciated. I hope we can shape some of this into more concrete
technical proposals and work on these issues. Generally we need to do
these things, if we want to be able to efficiently serve sites in the
category of multiples of hundreds of thousands documents. For medium
and small sites this should bring improvements as well. But a lot of
this doesn't apply as long as you run with just one ZEO client that
has all the data anyways.

Sorry for the long mail :)
Hanno

------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Laurence Rowe Laurence Rowe
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

Hanno Schlichting-4 wrote
Hi there,

I know various people have been looking into using Solr as a catalog
replacement. After spending some considerable time on this for a
project, I have some observations. Maybe others have had other
experiences they'd like to share.

Note that for any small sites you generally won't run into this. Sites
up to a couple of 10.000 documents usually work fine. This really
applies to sites which have 100.000+ documents (objects in the
catalog). Note that for any non-trivial site I'd strongly suggest to
use experimental.catalogqueryplan. We couldn't run our large sites
without this and it makes the catalog a much less urgent bottleneck.

The good news:

With collective.solr it is pretty easy to use Solr as a backend for
full text search queries. Replacing "SearchableText" queries is almost
always a win, as the ZCTextIndex implementation is embarrassingly
simple and ineffective at the same time. Additional features like
facets are a nice bonus. There's small issues being worked on all the
time and things like efficient batching being implemented, but in
general it works pretty well.

More generally following the Solr community I think they already have
a fantastic product, a thriving community and lots of good things
planned for the next releases. They merged development with their base
Lucene library, have good plans for a 2.0 release and much more.
There's a number of CMS systems which have opted for Solr and I think
Plone should do this in the mid-term as well. Search is just too
complicated a problem to solve on our own. What we offer is search
technology from 10 years back - the world has moved on.
This is a big dependency for small sites - you need to configure a JVM and an extra daemon process. I'd be reluctant to just remove the, albeit limited, inbuilt text search capabilities of Plone.

Hanno Schlichting-4 wrote
The bad news:

Solr is tailored to getting content updates in batches and doing
"commits" to the base index infrequently. They currently suggest to
expect a delay of at least a minute, before updates to the index are
available in the search results. There's a whole lot of tweaking you
can do to caches and there's improvements planned for "near realtime
search" (http://wiki.apache.org/solr/NearRealtimeSearch). But
generally Solr won't ever be able to make a guarantee on returning
data in searches in time. The way to get better search response times
is generally by doing fewer content updates and living with longer
delays. For full text searches this is a sensible strategy.

The catalog on the other hand is integrated into the rest of the
database transactions and you can expect to do a change to the catalog
and see those results reflected in a catalog search in the same
transaction. The collective.indexing project tried to improve this and
move all catalog indexing to the end of the transaction to avoid any
duplicate operations. Unfortunately it had to bail out of most
optimizations, as there's too much code that relies on these
in-transaction updates.
I suspect that most of this in-transaction usage could be eliminated by simply changing page edits to return a redirect rather than render an entire page. This would significantly reduce the time spent inside the transaction. I think the likelihood of a conflict error should be proportional to the transaction time multiplied by the number of (writing) threads.

You could even use X-Accel-Redirect type tricks to avoid the extra roundtrip to the client.

Hanno Schlichting-4 wrote
We tried dispatching all queries to Solr instead of to the catalog in
one project. This was only for ZEO clients serving anonymous traffic,
so there would be no confusion about edit operations not taking affect
immediately. Even with all Solr queries taking 2ms on average,
rendering Plone pages as a whole has seen an increase in time of 100%
and more. With Solr every catalog query suddenly has a small network
delay and an overhead for parsing XML responses into Python data.
Currently there's no cache on the ZEO client side for these results,
so you have to transport all data over and over again.
Do you have an idea of how many catalogue queries are required to render an average page?

Hanno Schlichting-4 wrote
My conclusion for the moment is, that using Solr as a transparent
replacement for the entire catalog is not possible. I think what we
need to do first, is to change Plone code and its use of the catalog.
I had some naive hope of being able to avoid this and getting a magic
silver bullet ;)

1. There's a number of catalog searches that can always live with a
short delay. Examples are some portlets ("latest five news", "recent
events", "review queue") - where a potential delay of a minute will be
acceptable. Other portlets like the navigation tree generally need to
be updated immediately. I think we need to add markers to the code,
which signals this to the underlying layers. Say add "async=True" as
an argument to the catalog query as a simple first step. The
underlying layer can then decide which search backend to use and for
"async" searches rely on Solr.
Those portlets are prime candidates for replacing with ESI / AJAX includes. If groups and roles could be encoded in the authentication cookie – now possible with the mod_auth_tkt format in Plone 4 – they could be generated as a simple transform of the Solr XML without any need for tying up a Zope thread.

Hanno Schlichting-4 wrote
2. Add information about the "columns" that are used from a specific
query to the catalog call. This is equivalent of saying "select a, b
from foo" vs. "select * from foo". Solr has a similar mechanism, which
allows to specify the data one expects on the returned flares
(brains). This can limit the amount of data that needs to be
transported in significant ways. In one project the full fledged
metadata of an object is about 1kb of XML each - which quickly grows
out of hand for multiple queries of 10-20kb for each page view and
parsing that data from XML. If all you need is "id", "title" and
"path" you shouldn't have to transport and parse the rest.

3. Carefully review all places we do catalog queries and consider
replacing them with non-catalog solutions. I know some people started
working on this and I think we need to do it. With Plone 4, blob
support and the new plone.folder package there's not many reasons
anymore to call the catalog all the time. Filtering for security
restrictions is the one thing we need to find a convenient and
performant API for. But generally all operations that work "on the
current or one specific context / folder" should be done without the
catalog.
Filtering for security restrictions is the key requirement, but that said, for the most part subfolders have the same permissions as their parents - filtering out a few forbidden entries is not too expensive.

Hanno Schlichting-4 wrote
4. Encourage any experiments with using a SQL database as a catalog
backend. I have heard people talk about it, but never seen anything
concrete. An SQL database can be integrated into the same transaction
as the ZODB and deliver a strong guarantee on consistency. This could
be used for the kinds of queries where this strong guarantee is
required. The main trick for this is, that an SQL database can do
queries on the server side. With ZEO we currently transport the entire
catalog of potentially some hundreds of mb to each application
instance, do a query on it and use a couple bytes at the end. This
scales to a couple of ZEO clients, but we are running sites which have
10 or 20 ZEO clients. Doing all calculations client side is just not a
good model there. The ZODB has no query language that can be used to
do server side queries.
Querying a SQL database will incur the same trade off as querying Solr. Replicating the catalogue (whether SQL or Solr) onto each client machine would eliminate the network latency, and would only require one copy of the catalogue per machine, rather than per Zope thread (you normally run 4 single threaded Zope processes per box). This also has the advantage of being 'push' replication rather than the 'pull' replication of ZEO, which means the data is there when you first need it.

Hanno Schlichting-4 wrote
5. Push forward the __parent__ pointer move. This will help with those
queries that need to do a brain.getObject() today or where adding
metadata to the catalog would slow it down in general. In these cases
the catalog (or external backend) could simply return the list of
poids of the objects as a result. Loading a single object from the
ZODB is extremely fast and harnesses the ZODB cache. Currently we need
to traverse to the object from the root based on its path, so it gets
a proper Acquisition context to work with. The __parent__ pointer
changes avoid these and you get a fully functional object from a
"connection.get(poid)" call without any more wrapping or traversing.
While traversing up from an object is more efficient than traversing down from an object (no intermediate B-Tree lookups required), to render an object's URL or calculate security permissions you still need to traverse its parents. However, if you already know those from the catalogue then it could be a big saving.

Hanno Schlichting-4 wrote
6. Consider splitting out a catalog used for navigation from the
general catalog. The navigation tree and sitemap rely on indexing the
path and doing queries on that. But they generally need little
metadata to be indexed. If we could avoid indexing the path in the
normal catalog, moving an object inside the site hierarchy wouldn't
invalidate the larger catalog anymore. This relies on getting a good
UID story to use as a unique key in the catalog. As long as it uses
"getPhysicalPath" internally as its unique key, there's no real
advantage.
I think this is the real saving from more the __parent__ pointer business. When you can access an object cheaply there is no need to keep so much metadata about it - perhaps not any at all, other than path and permissions - the sort of stuff you might store in a zc.relationship type catalogue.

Hanno Schlichting-4 wrote
That's it for now. If you have any comments, they are much
appreciated. I hope we can shape some of this into more concrete
technical proposals and work on these issues. Generally we need to do
these things, if we want to be able to efficiently serve sites in the
category of multiples of hundreds of thousands documents. For medium
and small sites this should bring improvements as well. But a lot of
this doesn't apply as long as you run with just one ZEO client that
has all the data anyways.

Sorry for the long mail :)
Thanks for the detailed analysis!

Laurence
Martin Aspeli Martin Aspeli
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

In reply to this post by Hanno Schlichting-4
Hi Hanno,

On 9 May 2010 06:12, Hanno Schlichting <[hidden email]> wrote:

> Hi there,
>
> I know various people have been looking into using Solr as a catalog
> replacement. After spending some considerable time on this for a
> project, I have some observations. Maybe others have had other
> experiences they'd like to share.
>
> Note that for any small sites you generally won't run into this. Sites
> up to a couple of 10.000 documents usually work fine. This really
> applies to sites which have 100.000+ documents (objects in the
> catalog).

I think this is an important point. As a dependency, Java + Solr is a
troublesome for small deployments, and possibly even for development
environments on projects that may one day grow "big".

> Note that for any non-trivial site I'd strongly suggest to
> use experimental.catalogqueryplan. We couldn't run our large sites
> without this and it makes the catalog a much less urgent bottleneck.

If that's true, I think we need to get this out of experimental.* and
into Zope/CMF/Plone.

> The good news:
>
> With collective.solr it is pretty easy to use Solr as a backend for
> full text search queries. Replacing "SearchableText" queries is almost
> always a win, as the ZCTextIndex implementation is embarrassingly
> simple and ineffective at the same time. Additional features like
> facets are a nice bonus. There's small issues being worked on all the
> time and things like efficient batching being implemented, but in
> general it works pretty well.
>
> More generally following the Solr community I think they already have
> a fantastic product, a thriving community and lots of good things
> planned for the next releases. They merged development with their base
> Lucene library, have good plans for a 2.0 release and much more.
> There's a number of CMS systems which have opted for Solr and I think
> Plone should do this in the mid-term as well. Search is just too
> complicated a problem to solve on our own. What we offer is search
> technology from 10 years back - the world has moved on.

At least I think we should make it a supported option. Lots of sites
don't really need search, or only need trivial search. Lucene/Solr is
always going to be another long-running process and more overhead to
set up and maintain. I don't think it'll ever be the right choice for
*everyone*, unless we port Plone to Java. ;-)

> The bad news:
>
> Solr is tailored to getting content updates in batches and doing
> "commits" to the base index infrequently. They currently suggest to
> expect a delay of at least a minute, before updates to the index are
> available in the search results. There's a whole lot of tweaking you
> can do to caches and there's improvements planned for "near realtime
> search" (http://wiki.apache.org/solr/NearRealtimeSearch). But
> generally Solr won't ever be able to make a guarantee on returning
> data in searches in time. The way to get better search response times
> is generally by doing fewer content updates and living with longer
> delays. For full text searches this is a sensible strategy.
>
> The catalog on the other hand is integrated into the rest of the
> database transactions and you can expect to do a change to the catalog
> and see those results reflected in a catalog search in the same
> transaction. The collective.indexing project tried to improve this and
> move all catalog indexing to the end of the transaction to avoid any
> duplicate operations. Unfortunately it had to bail out of most
> optimizations, as there's too much code that relies on these
> in-transaction updates.

I don't think this is quite so much of an issue for the full-text
index scenario. It's an issue because we also use the catalog for
navigation and other UI elements. If your page doesn't show up in the
navtree when you've added it, you'll get upset. If it takes a minute
for it to appear in a full text search from the "search" box, not so
much.

> We tried dispatching all queries to Solr instead of to the catalog in
> one project. This was only for ZEO clients serving anonymous traffic,
> so there would be no confusion about edit operations not taking affect
> immediately. Even with all Solr queries taking 2ms on average,
> rendering Plone pages as a whole has seen an increase in time of 100%
> and more.

You mean "rendering time doubled (and requests-per-second halved) with Solr"?

> With Solr every catalog query suddenly has a small network
> delay and an overhead for parsing XML responses into Python data.
> Currently there's no cache on the ZEO client side for these results,
> so you have to transport all data over and over again.

Yes, and that's always going to be the case with out-of-process solutions.

> My conclusion for the moment is, that using Solr as a transparent
> replacement for the entire catalog is not possible. I think what we
> need to do first, is to change Plone code and its use of the catalog.
> I had some naive hope of being able to avoid this and getting a magic
> silver bullet ;)
>
> 1. There's a number of catalog searches that can always live with a
> short delay

... but not necessarily a slow-down caused by network lag etc.

>. Examples are some portlets ("latest five news", "recent
> events", "review queue") - where a potential delay of a minute will be
> acceptable. Other portlets like the navigation tree generally need to
> be updated immediately. I think we need to add markers to the code,
> which signals this to the underlying layers. Say add "async=True"

(poor name - it makes it look like the query results are coming back
asynchronously, which wouldn't be the case)

> as
> an argument to the catalog query as a simple first step. The
> underlying layer can then decide which search backend to use and for
> "async" searches rely on Solr.

Does that mean keeping all data in Solr *and* portal_catalog?

> 2. Add information about the "columns" that are used from a specific
> query to the catalog call. This is equivalent of saying "select a, b
> from foo" vs. "select * from foo". Solr has a similar mechanism, which
> allows to specify the data one expects on the returned flares
> (brains). This can limit the amount of data that needs to be
> transported in significant ways. In one project the full fledged
> metadata of an object is about 1kb of XML each - which quickly grows
> out of hand for multiple queries of 10-20kb for each page view and
> parsing that data from XML. If all you need is "id", "title" and
> "path" you shouldn't have to transport and parse the rest.

+1

> 3. Carefully review all places we do catalog queries and consider
> replacing them with non-catalog solutions. I know some people started
> working on this and I think we need to do it. With Plone 4, blob
> support and the new plone.folder package there's not many reasons
> anymore to call the catalog all the time. Filtering for security
> restrictions is the one thing we need to find a convenient and
> performant API for. But generally all operations that work "on the
> current or one specific context / folder" should be done without the
> catalog.

+1 (and yes, I love the irony)

> 4. Encourage any experiments with using a SQL database as a catalog
> backend. I have heard people talk about it, but never seen anything
> concrete. An SQL database can be integrated into the same transaction
> as the ZODB and deliver a strong guarantee on consistency. This could
> be used for the kinds of queries where this strong guarantee is
> required. The main trick for this is, that an SQL database can do
> queries on the server side. With ZEO we currently transport the entire
> catalog of potentially some hundreds of mb to each application
> instance, do a query on it and use a couple bytes at the end. This
> scales to a couple of ZEO clients, but we are running sites which have
> 10 or 20 ZEO clients. Doing all calculations client side is just not a
> good model there. The ZODB has no query language that can be used to
> do server side queries.

I've always been intrigued by this. I don't think I have the
wherewithal to actually implement it, but it'd be a really nice
project for someone to do. I suspect that with SQLAlchemy (and
z3c.saconfig) we could get a nice, performant, optimisable solution.

Most likely, you'd want this as a drop-in replacement, so that small
sites didn't need to configure an RDBMS (I assume sqlite is out of the
question as it's single-threaded/hard to do in ZEO?).

> 5. Push forward the __parent__ pointer move. This will help with those
> queries that need to do a brain.getObject() today or where adding
> metadata to the catalog would slow it down in general. In these cases
> the catalog (or external backend) could simply return the list of
> poids of the objects as a result. Loading a single object from the
> ZODB is extremely fast and harnesses the ZODB cache. Currently we need
> to traverse to the object from the root based on its path, so it gets
> a proper Acquisition context to work with. The __parent__ pointer
> changes avoid these and you get a fully functional object from a
> "connection.get(poid)" call without any more wrapping or traversing.

Interesting point. That should be feasible if we fix the export use
case. The code's already there in p.a.folder.

> 6. Consider splitting out a catalog used for navigation from the
> general catalog. The navigation tree and sitemap rely on indexing the
> path and doing queries on that. But they generally need little
> metadata to be indexed. If we could avoid indexing the path in the
> normal catalog, moving an object inside the site hierarchy wouldn't
> invalidate the larger catalog anymore. This relies on getting a good
> UID story to use as a unique key in the catalog. As long as it uses
> "getPhysicalPath" internally as its unique key, there's no real
> advantage.

+1

> That's it for now. If you have any comments, they are much
> appreciated. I hope we can shape some of this into more concrete
> technical proposals and work on these issues. Generally we need to do
> these things, if we want to be able to efficiently serve sites in the
> category of multiples of hundreds of thousands documents. For medium
> and small sites this should bring improvements as well. But a lot of
> this doesn't apply as long as you run with just one ZEO client that
> has all the data anyways.

I'd love to see this move forward for Plone 4.x or 5. I know Roche
from Upfront has been working on some of this as patches to Plone 3,
so at least there's some work started.

Martin

------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Hanno Schlichting-4 Hanno Schlichting-4
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

In reply to this post by Laurence Rowe
On Sun, May 9, 2010 at 3:36 AM, Laurence Rowe <[hidden email]> wrote:
> Hanno Schlichting-4 wrote:
>> There's a number of CMS systems which have opted for Solr and I think
>> Plone should do this in the mid-term as well. Search is just too
>> complicated a problem to solve on our own. What we offer is search
>> technology from 10 years back - the world has moved on.
>
> This is a big dependency for small sites - you need to configure a JVM and
> an extra daemon process. I'd be reluctant to just remove the, albeit
> limited, inbuilt text search capabilities of Plone.

I know there's an overhead. But any site that uses a ZEO server, ZEO
client setup, Varnish and a web server like Apache or Nginx already
has quite a bit to deal with. Java has the advantage of being
cross-platform, so there's not much extra care for Windows. Ever since
we moved to buildout we have the means to make this whole setup easy
and come with good defaults. Java itself tends to be much more
widespread and relying on OS system packages isn't much of a problem.

What we have is unfortunately not bad enough to force everyone to
replace it. But it is hindering any kind of progress or innovation we
could built on top of a powerful search backend. Facets, term vectors,
"more like this", GEO spatial, the whole "index binary content" like
Word, PDF, story - we have half-baked or no solutions to these. I
think we don't do anyone a good service, if we don't deliver in these
fields as a CMS.

As usual good engineering principles apply, so this should still be
modular and allow people to opt-out and go for external search
solutions. And this is certainly all mid-term (two-three years or
longer), as Solr currently lacks any good multilingual content story
or has much to offer for Asian languages.

> Hanno Schlichting-4 wrote:
> I suspect that most of this in-transaction usage could be eliminated by
> simply changing page edits to return a redirect rather than render an entire
> page. This would significantly reduce the time spent inside the transaction.

Yes. I think this is worth exploring. Decoupling the "change
operation" from rendering the result of the change might make sense.
But this is orthogonal to the other problems. Things like Solr won't
give you any guarantee on returning the correct data even in the
subsequent request. Whenever you need consistent data, you'll have to
use something that can guarantee it at the expensive of waiting.

> I think the likelihood of a conflict error should be proportional to the
> transaction time multiplied by the number of (writing) threads.

Actually most of the time there shouldn't be any conflicts unless
people are changing the same page. From looking at the enfold.fixes
things and my recent experience, I think there's just some indexes
using the wrong data types in the catalog. I think some DateIndexes
use IISet's for some of the inner nodes, instead of IITreeSets.
Especially on date indexes we need to make sure these have a sensible
resolution, Alan Runyan suggested one second instead of the current
one minute, which make changes to the same buckets much less likely.
Think of the modified index and the effective one being set on some
workflow actions. He reported having high numbers of concurrent writes
on sites after applying a number of small fixes.

> Hanno Schlichting-4 wrote:
> Do you have an idea of how many catalogue queries are required to render an
> average page?

I don't have any exact numbers, but generally there's many portlets on
the pages so a dozen or more should be common. Most of them only
return five items or up to 20 for the content listing page.

> Hanno Schlichting-4 wrote:
> Those portlets are prime candidates for replacing with ESI / AJAX includes.

That's another step to take on top of it. But there's no reason for
not making the query itself more efficient. The catalog has no notion
of a result limit or any kind of efficient sorting implementation. In
order to get the latest five news items, the standard catalog query
will sort the entire catalog set for the "Date" index on each query.
With large amounts of content this sorting alone contributes a
significant amount of time. Avoiding it even for the "no cache hit"
case is worthwhile.

> If groups and roles could be encoded in the authentication cookie – now
> possible with the mod_auth_tkt format in Plone 4 – they could be generated
> as a simple transform of the Solr XML without any need for tying up a Zope
> thread.

I don't understand enough of our security machinery to judge what is
possible. I know that the allowedRolesAndUsers index is a major
bottleneck on sites with large numbers of authenticated users. Simply
transferring all the data from the user specific btrees from inside
the allowedRolesAndUsers index takes a lot of time and trashes the
ZODB cache. But combinations of ownership, local owner roles, local
roles, group roles, nested groups, ... make my head spin :) As long as
we give user specific permissions on each new document a user creates,
there's little we can do. I think we'd need to take away some of that
special owner handling before we can move on with this - otherwise we
still have user specific permissions and checks all over. Most
organizations don't need those, but still pay the price now.

> Hanno Schlichting-4 wrote:
> Querying a SQL database will incur the same trade off as querying Solr.

Not quite. SQL databases are built and optimized for frequent updates
and queries. Solr currently needs to update data on the filesystem,
generate a new "searcher", warm this up from an old one to populate
caches and some more. Some of the planned improvements will allow it
to do some of this from memory instead of disk. But the entire design
is not built for this.

But only some real experiment could show what happens here. I'm merely
suggesting someone should try this out.

> Replicating the catalogue (whether SQL or Solr) onto each client machine
> would eliminate the network latency, and would only require one copy of the
> catalogue per machine, rather than per Zope thread (you normally run 4
> single threaded Zope processes per box).

In the setup I tested this, I had a network latency of under one
millisecond on a gigabit line. The network latency is usually not a
real factor, but the time it takes to lookup and generate the data in
the external database and then decode it on the client side. Effective
use of specifying the "columns" is key here. Just to give a number: A
single ZEO client produced something like 20mbit / second of incoming
Solr traffic - as the site has a lot of additional metadata.

> Hanno Schlichting-4 wrote:
> When you can access an object cheaply there is no need to keep so much
> metadata about it - perhaps not any at all, other than path and permissions
> - the sort of stuff you might store in a zc.relationship type catalogue.

Indeed. I think we could end up in a situation at some point, where we
don't need metadata in general. We have an awesome key/value store, if
we would be able to use it as such, it would perform quite well.

Hanno

------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Hanno Schlichting-4 Hanno Schlichting-4
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

In reply to this post by Martin Aspeli
On Sun, May 9, 2010 at 5:30 AM, Martin Aspeli <[hidden email]> wrote:
> On 9 May 2010 06:12, Hanno Schlichting <[hidden email]> wrote:
>> Note that for any non-trivial site I'd strongly suggest to
>> use experimental.catalogqueryplan. We couldn't run our large sites
>> without this and it makes the catalog a much less urgent bottleneck.
>
> If that's true, I think we need to get this out of experimental.* and
> into Zope/CMF/Plone.

It can go in, once we define new and clear catalog and indexing
contracts. The amount of trouble and monkey-patching all these
products have to do today is just not funny anymore. There's too much
flexibility and too little good abstraction to make this solid - not
the kind of code we want in the core. I think collective.indexing can
grow into plone.indexing at some point (and replace any
CMFCataglogAware, Archetypes CatalogMultiplex stuff). queryplan should
grow into a plone.catalog at some point, probably growing a better API
for features like the "show_inactive" flag, LinguaPlone's "Language"
filter and the like.

> I don't think this is quite so much of an issue for the full-text
> index scenario. It's an issue because we also use the catalog for
> navigation and other UI elements. If your page doesn't show up in the
> navtree when you've added it, you'll get upset. If it takes a minute
> for it to appear in a full text search from the "search" box, not so
> much.

Oh sure. That's the entire reason why using Solr for full text
searches works at all. You don't have any strong requirement on
immediate updates.

>> Even with all Solr queries taking 2ms on average,
>> rendering Plone pages as a whole has seen an increase in time of 100%
>> and more.
>
> You mean "rendering time doubled (and requests-per-second halved) with Solr"?

Yes. In this case the 90 percentile went from 2 seconds to 4 seconds.
The median somewhere from 700ms to 1.5 seconds. This was during a time
where no actual content updates took place, so the Solr index was
static for a number of hours. With actual content editing taking place
and the Solr caches being trashed, I'd expect this to be much worse.

>> Say add "async=True" as
>> an argument to the catalog query as a simple first step. The
>> underlying layer can then decide which search backend to use and for
>> "async" searches rely on Solr.
>
> Does that mean keeping all data in Solr *and* portal_catalog?

Yes. You can throw out the SearchableText from the the portal_catalog,
but otherwise I'd keep all indexes in both places in general. You can
optimize that in specific sites/applications, but I don't see a good
model for splitting things in general.

> I'd love to see this move forward for Plone 4.x or 5. I know Roche
> from Upfront has been working on some of this as patches to Plone 3,
> so at least there's some work started.

Indeed. This is all just to encourage people to start doing something
about it. We need these kind of changes and we should be willing to
accept changes in this direction in the Plone 4.x releases - one PLIP
at a time.

Hanno

------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Martin Aspeli Martin Aspeli
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

Hi Hanno,

On 9 May 2010 23:35, Hanno Schlichting <[hidden email]> wrote:

> On Sun, May 9, 2010 at 5:30 AM, Martin Aspeli <[hidden email]> wrote:
>> On 9 May 2010 06:12, Hanno Schlichting <[hidden email]> wrote:
>>> Note that for any non-trivial site I'd strongly suggest to
>>> use experimental.catalogqueryplan. We couldn't run our large sites
>>> without this and it makes the catalog a much less urgent bottleneck.
>>
>> If that's true, I think we need to get this out of experimental.* and
>> into Zope/CMF/Plone.
>
> It can go in, once we define new and clear catalog and indexing
> contracts. The amount of trouble and monkey-patching all these
> products have to do today is just not funny anymore.

Mmm... if the received advice is "you should install this" then even
those "ugly" patches are better than the status quo. Or, you need to
change the advice to "you should install this if ...". ;-)

> There's too much
> flexibility and too little good abstraction to make this solid - not
> the kind of code we want in the core. I think collective.indexing can
> grow into plone.indexing at some point (and replace any
> CMFCataglogAware, Archetypes CatalogMultiplex stuff). queryplan should
> grow into a plone.catalog at some point, probably growing a better API
> for features like the "show_inactive" flag, LinguaPlone's "Language"
> filter and the like.

I'm all for this, of course, though I suspect someone at Jarn will
need to drive it. ;)


>> You mean "rendering time doubled (and requests-per-second halved) with Solr"?
>
> Yes. In this case the 90 percentile went from 2 seconds to 4 seconds.
> The median somewhere from 700ms to 1.5 seconds. This was during a time
> where no actual content updates took place, so the Solr index was
> static for a number of hours. With actual content editing taking place
> and the Solr caches being trashed, I'd expect this to be much worse.

So in other words, Solr slows things down. ;-)

>>> Say add "async=True" as
>>> an argument to the catalog query as a simple first step. The
>>> underlying layer can then decide which search backend to use and for
>>> "async" searches rely on Solr.
>>
>> Does that mean keeping all data in Solr *and* portal_catalog?
>
> Yes. You can throw out the SearchableText from the the portal_catalog,
> but otherwise I'd keep all indexes in both places in general. You can
> optimize that in specific sites/applications, but I don't see a good
> model for splitting things in general.

What about having *just* SearchableText in Solr and the rest in
portal_catalog? This way, Solr is used for the main search form, but
nothing else (which is surely the use that makes sense).

>> I'd love to see this move forward for Plone 4.x or 5. I know Roche
>> from Upfront has been working on some of this as patches to Plone 3,
>> so at least there's some work started.
>
> Indeed. This is all just to encourage people to start doing something
> about it. We need these kind of changes and we should be willing to
> accept changes in this direction in the Plone 4.x releases - one PLIP
> at a time.

Well, +1 from me anyway. :)

Martin

------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Alexander Limi Alexander Limi
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

On Sun, May 9, 2010 at 10:44 AM, Martin Aspeli <[hidden email]> wrote:
> On Sun, May 9, 2010 at 5:30 AM, Martin Aspeli <[hidden email]> wrote:
>> On 9 May 2010 06:12, Hanno Schlichting <[hidden email]> wrote:
>>> Note that for any non-trivial site I'd strongly suggest to
>>> use experimental.catalogqueryplan. We couldn't run our large sites
>>> without this and it makes the catalog a much less urgent bottleneck.
>>
>> If that's true, I think we need to get this out of experimental.* and
>> into Zope/CMF/Plone.
>
> It can go in, once we define new and clear catalog and indexing
> contracts. The amount of trouble and monkey-patching all these
> products have to do today is just not funny anymore.

Mmm... if the received advice is "you should install this" then even
those "ugly" patches are better than the status quo. Or, you need to
change the advice to "you should install this if ...". ;-)

Is this something that would be suitable as a 4.x PLIP, or is it too intrusive?

--
Alexander Limi · http://limi.net

------------------------------------------------------------------------------


_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Alexander Limi · http://limi.net

Rok Garbas Rok Garbas
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

In reply to this post by Martin Aspeli
hi all,

2010/5/9 Martin Aspeli <[hidden email]>:

> Hi Hanno,
>
> On 9 May 2010 23:35, Hanno Schlichting <[hidden email]> wrote:
>> On Sun, May 9, 2010 at 5:30 AM, Martin Aspeli <[hidden email]> wrote:
>>> On 9 May 2010 06:12, Hanno Schlichting <[hidden email]> wrote:
>>>> Note that for any non-trivial site I'd strongly suggest to
>>>> use experimental.catalogqueryplan. We couldn't run our large sites
>>>> without this and it makes the catalog a much less urgent bottleneck.
>>>
>>> If that's true, I think we need to get this out of experimental.* and
>>> into Zope/CMF/Plone.
>>
>> It can go in, once we define new and clear catalog and indexing
>> contracts. The amount of trouble and monkey-patching all these
>> products have to do today is just not funny anymore.
>
> Mmm... if the received advice is "you should install this" then even
> those "ugly" patches are better than the status quo. Or, you need to
> change the advice to "you should install this if ...". ;-)
>

maybe something like haystack (http://haystacksearch.org) could be
used here. its a common api to some of search engines (Solr, Whoosh,
Xapian). not sure if it is possible to do this in optimized manner,
but for sure i like the idea of being able to choose between search
engines.




--
Rok Garbas, Python.Zope.Plone consulting
web: http://garbas.si
phone(si): +386 70 707 300
phone(es): +34 68 941 79 62
email: [hidden email]

------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Guido Stevens Guido Stevens
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

In reply to this post by Martin Aspeli
On 05/09/2010 05:44 PM, Martin Aspeli wrote:
> What about having *just* SearchableText in Solr and the rest in
> portal_catalog? This way, Solr is used for the main search form, but
> nothing else (which is surely the use that makes sense).

Wouldn't you lose significant Solr features that way? Like faceted
search. Or, maybe more obscure, the ability to do federated (cross-site)
search on more than just SearchableText?

:*CU#
--
***   Guido A.J. Stevens        ***   tel: +31.43.3618933    ***
***   [hidden email]   ***   Postbus 619            ***
***   http://www.cosent.nl      ***   6200 AP  Maastricht    ***

             s h a r i n g    m a k e s    s e n s e

    Nullege: A Search Engine for Python source code
    http://shar.es/m6cZ1

    http://twitter.com/GuidoStevens

------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
yuri-2 yuri-2
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

Il 10/05/2010 10:23, Guido A.J. Stevens ha scritto:
> On 05/09/2010 05:44 PM, Martin Aspeli wrote:
>    
>> What about having *just* SearchableText in Solr and the rest in
>> portal_catalog? This way, Solr is used for the main search form, but
>> nothing else (which is surely the use that makes sense).
>>      
> Wouldn't you lose significant Solr features that way? Like faceted
> search.
>    
where's the faceted search? :-o


------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Roel Bruggink Roel Bruggink
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

In reply to this post by Guido Stevens
On Mon, May 10, 2010 at 10:23 AM, Guido A.J. Stevens <[hidden email]> wrote:
On 05/09/2010 05:44 PM, Martin Aspeli wrote:
> What about having *just* SearchableText in Solr and the rest in
> portal_catalog? This way, Solr is used for the main search form, but
> nothing else (which is surely the use that makes sense).

Wouldn't you lose significant Solr features that way? Like faceted
search. Or, maybe more obscure, the ability to do federated (cross-site)
search on more than just SearchableText?


We're looking for a decent zctextfield replacement, not a way to use solr. So that should not be an issue.
Isn't TextIndexNG3 an option?

--
Roel Bruggink
http://www.fourdigits.nl/mensen/roel-bruggink

Four Digits BV
http://www.fourdigits.nl
Willemsplein 44, 6811 KD, Arnhem
tel: +31(0)26 4422700 fax: +31(0)84 2206117
KVK 091621370000 BTW 8161.22.234.B01

------------------------------------------------------------------------------


_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Guido Stevens Guido Stevens
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

In reply to this post by yuri-2
On 05/10/2010 11:04 AM, Yuri wrote:
> where's the faceted search? :-o

Do you mean: how do I enable faceted search in a Solr-integrated Plone site?

/@@solr-controlpanel

Add 'Subject' as a 'Default search facet' and activate the Subject
checkbox in 'Filter query parameters'.

This will popup a porlet-like facet navigation on the Subject axis in
your search results.

:*CU#
--
***   Guido A.J. Stevens        ***   tel: +31.43.3618933    ***
***   [hidden email]   ***   Postbus 619            ***
***   http://www.cosent.nl      ***   6200 AP  Maastricht    ***

             s h a r i n g    m a k e s    s e n s e

    Nullege: A Search Engine for Python source code
    http://shar.es/m6cZ1

    http://twitter.com/GuidoStevens

------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Paul Everitt-3 Paul Everitt-3
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

In reply to this post by Hanno Schlichting-4
On 5/8/10 6:12 PM, Hanno Schlichting wrote:
> 4. Encourage any experiments with using a SQL database as a catalog
> backend.

FWIW, we are experimenting with this on the KARL project.  Shane
Hathaway conceived and wrote, and Chris Rossi completed, a catalog
plugin for text indexing:

   http://svn.repoze.org/repoze.pgtextindex/trunk/docs/index.rst

With this, you wouldn't have to replicate the entire catalog into both
sides.  You leave everything the way it is in Plone's catalog, you just
no longer do text indexing.  That instead is done in Postgresql, which
doesn't need any of the other catalog indices.

Everything about the current catalog is shared between the "light" OOTB
Plone and the larger-scale Plone, except the swapped-out text indexer.
Thus, it is a much smaller change than SOLR, and possibly easier for
long-term maintenance of Plone code.  You maintain transactional
integrity, and I suppose it could be argued that asking someone to run
an RDBMS is a bit less demanding than running the stack used by SOLR.

This approach looks very promising.  The search quality and flexibility
is way up, resource consumption is *WAY* down, and search performance is
way up.  We haven't put this into production though.

Just food for thought.

--Paul


------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Laurence Rowe Laurence Rowe
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

In reply to this post by yuri-2
Yuri-11 wrote
Il 10/05/2010 10:23, Guido A.J. Stevens ha scritto:
> On 05/09/2010 05:44 PM, Martin Aspeli wrote:
>    
>> What about having *just* SearchableText in Solr and the rest in
>> portal_catalog? This way, Solr is used for the main search form, but
>> nothing else (which is surely the use that makes sense).
>>      
> Wouldn't you lose significant Solr features that way? Like faceted
> search.
>    
where's the faceted search? :-o
collective.solr should support faceted search out the box IIRC.

Laurence
Alan Runyan-3 Alan Runyan-3
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

In reply to this post by Hanno Schlichting-4
Hi guys.  I want to throw my 2c into this discussion.  We have been
doing oodles of optimizations for customers and I feel we have
some experience that may be useful.

Like always I have all sorts of crap to say.  I'm trying to keep it simple.

> Note that for any small sites you generally won't run into this. Sites
> up to a couple of 10.000 documents usually work fine. This really
> applies to sites which have 100.000+ documents (objects in the
> catalog). Note that for any non-trivial site I'd strongly suggest to
> use experimental.catalogqueryplan. We couldn't run our large sites
> without this and it makes the catalog a much less urgent bottleneck.

We have used this in the past.  IIRC it was fairly opaque without
a significant investment.  Would be nice to have some reporting feature
to know how well its working.  Maybe this is in there now.

Kudo's to this technology.  It would be great to see the broader
zope community using this.  repoze, grok, etc.

> With collective.solr it is pretty easy to use Solr as a backend for
> full text search queries. Replacing "SearchableText" queries is almost
> always a win, as the ZCTextIndex implementation is embarrassingly
> simple and ineffective at the same time. Additional features like
> facets are a nice bonus. There's small issues being worked on all the
> time and things like efficient batching being implemented, but in
> general it works pretty well.

There are other several options:

  - Shane's work using postgresql
  - Andrea's work using zopyx.textindexng3
  - Witsch's collective.solr

> My conclusion for the moment is, that using Solr as a transparent
> replacement for the entire catalog is not possible. I think what we
> need to do first, is to change Plone code and its use of the catalog.
> I had some naive hope of being able to avoid this and getting a magic
> silver bullet ;)

The portal_catalog is too dumb to be able to represent the richness of
a third party search engine.  My feeling is the best strategy is:

  - To implement the external search engine utility in pure python.
    - Easy to test
    - Reusable
    - Full features
    - Should not be limiting

  - Write a ZCatalog fascade which integrates with the search engine utility
    - Backwards compatibility
    - Put your compromises in here
    - Plone integration tests here

> 1. There's a number of catalog searches that can always live with a
> short delay. Examples are some portlets ("latest five news", "recent
> events", "review queue") - where a potential delay of a minute will be
> acceptable. Other portlets like the navigation tree generally need to
> be updated immediately. I think we need to add markers to the code,
> which signals this to the underlying layers. Say add "async=True" as
> an argument to the catalog query as a simple first step. The
> underlying layer can then decide which search backend to use and for
> "async" searches rely on Solr.

-1.  On "leaking" expectations of async nature into client code.
Maybe we need more catalog's.  You talk about this below.
e.g. navigation catalog, fulltext catalog, content catalog, etc.

FWIW, We have integrated zc.async into plone.
https://svn.enfoldsystems.com/public/plone.async.core/
https://svn.enfoldsystems.com/public/plone.async.indexing/

What should happen is people should be able to write this own
indexing strategy easily.  An example:

  - Update navigation_catalog immediately (say 3-4 indices)
  - Create a async Job to update the rest of the catalogs out-of-process
  - Or possibly update SOLR (NOTE: solr would not need to have comform
    to the ZCatalog, since there is a function that simply takes the info from
    the indexing operation and feed it into SOLR.)

> 2. Add information about the "columns" that are used from a specific
> query to the catalog call. This is equivalent of saying "select a, b
> from foo" vs. "select * from foo". Solr has a similar mechanism, which
> allows to specify the data one expects on the returned flares
> (brains). This can limit the amount of data that needs to be
> transported in significant ways. In one project the full fledged
> metadata of an object is about 1kb of XML each - which quickly grows
> out of hand for multiple queries of 10-20kb for each page view and
> parsing that data from XML. If all you need is "id", "title" and
> "path" you shouldn't have to transport and parse the rest.

This would be more of a problem if your trying to reduce the result
from SOLR.  I'm unsure this is still a huge problem in catalog terms
if we have more narrowly defined catalogs.

> 3. Carefully review all places we do catalog queries and consider
> replacing them with non-catalog solutions. I know some people started
> working on this and I think we need to do it. With Plone 4, blob
> support and the new plone.folder package there's not many reasons
> anymore to call the catalog all the time. Filtering for security
> restrictions is the one thing we need to find a convenient and
> performant API for. But generally all operations that work "on the
> current or one specific context / folder" should be done without the
> catalog.

I agree we should use folder's instead of catalogs for folder listings.
Although we need to be mindful of what an item in a folder actually
loads across the wire.

Security is always hard.  Is there hard evidence that the security
infrastructure is slowing us down?  i.e. in the case of a
FolderContentsView which is loading objects from container and
all security checks are being done at API level instead of
RestrictedPython or sandboxing.

> 5. Push forward the __parent__ pointer move. This will help with those
> queries that need to do a brain.getObject() today or where adding
> metadata to the catalog would slow it down in general. In these cases
> the catalog (or external backend) could simply return the list of
> poids of the objects as a result. Loading a single object from the
> ZODB is extremely fast and harnesses the ZODB cache. Currently we need
> to traverse to the object from the root based on its path, so it gets
> a proper Acquisition context to work with. The __parent__ pointer
> changes avoid these and you get a fully functional object from a
> "connection.get(poid)" call without any more wrapping or traversing.

Isn't this wonderful?  No more Acquisition.  8-)
Some analysts used Acquisition against Plone in their CMS evaluations.

I believe later in this thread elro points that we *must* load the parents
so the context security can be fully computed.

> 6. Consider splitting out a catalog used for navigation from the
> general catalog. The navigation tree and sitemap rely on indexing the
> path and doing queries on that. But they generally need little
> metadata to be indexed. If we could avoid indexing the path in the
> normal catalog, moving an object inside the site hierarchy wouldn't
> invalidate the larger catalog anymore. This relies on getting a good
> UID story to use as a unique key in the catalog. As long as it uses
> "getPhysicalPath" internally as its unique key, there's no real
> advantage.

This is a necessity.  Also we should RECOMMEND people creating
their own catalog.  This is a much better proposition than people
adding indexes to CMF/Plone catalog's.  It is more encapsulated
and makes migration simpler.

alan

------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Roché Compaan Roché Compaan
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

On Tue, 2010-05-11 at 21:40 -0500, Alan Runyan wrote:

> Hi guys.  I want to throw my 2c into this discussion.  We have been
> doing oodles of optimizations for customers and I feel we have
> some experience that may be useful.
>
> Like always I have all sorts of crap to say.  I'm trying to keep it simple.
>
> > Note that for any small sites you generally won't run into this. Sites
> > up to a couple of 10.000 documents usually work fine. This really
> > applies to sites which have 100.000+ documents (objects in the
> > catalog). Note that for any non-trivial site I'd strongly suggest to
> > use experimental.catalogqueryplan. We couldn't run our large sites
> > without this and it makes the catalog a much less urgent bottleneck.
>
> We have used this in the past.  IIRC it was fairly opaque without
> a significant investment.  Would be nice to have some reporting feature
> to know how well its working.  Maybe this is in there now.
>
> Kudo's to this technology.  It would be great to see the broader
> zope community using this.  repoze, grok, etc.

In our deployments we rarely require such an advanced text index and in
cases where we did, we used textindexng3 or Google. I fear that
replacing the portal_catalog with another indexing solution will not
solve other fundamental problems in Plone.

First one has to clearly decouple catalogs used for content searches
from catalogs used to support functionality like navigation, folder
listings, review queue, etcs (see my blog post on this:
http://bit.ly/9068dz). Ideally there should be a single text index with
a handful of supporting indexes (effective, portal_type) in the catalog
used for content searches. If the catalog was this simple it becomes
much simpler to safely replace the whole catalog with alternatives
without breaking Plone functionality and it would make the introduction
of collective.solr a no-brainer.

> > With collective.solr it is pretty easy to use Solr as a backend for
> > full text search queries. Replacing "SearchableText" queries is almost
> > always a win, as the ZCTextIndex implementation is embarrassingly
> > simple and ineffective at the same time. Additional features like
> > facets are a nice bonus. There's small issues being worked on all the
> > time and things like efficient batching being implemented, but in
> > general it works pretty well.
>
> There are other several options:
>
>   - Shane's work using postgresql
>   - Andrea's work using zopyx.textindexng3
>   - Witsch's collective.solr

textindexng3 has served us well. Hanno, what is your experience with it?

>
> > My conclusion for the moment is, that using Solr as a transparent
> > replacement for the entire catalog is not possible. I think what we
> > need to do first, is to change Plone code and its use of the catalog.
> > I had some naive hope of being able to avoid this and getting a magic
> > silver bullet ;)
>
> The portal_catalog is too dumb to be able to represent the richness of
> a third party search engine.  My feeling is the best strategy is:
>
>   - To implement the external search engine utility in pure python.
>     - Easy to test
>     - Reusable
>     - Full features
>     - Should not be limiting
>
>   - Write a ZCatalog fascade which integrates with the search engine utility
>     - Backwards compatibility
>     - Put your compromises in here
>     - Plone integration tests here
>
> > 1. There's a number of catalog searches that can always live with a
> > short delay. Examples are some portlets ("latest five news", "recent
> > events", "review queue") - where a potential delay of a minute will be
> > acceptable. Other portlets like the navigation tree generally need to
> > be updated immediately. I think we need to add markers to the code,
> > which signals this to the underlying layers. Say add "async=True" as
> > an argument to the catalog query as a simple first step. The
> > underlying layer can then decide which search backend to use and for
> > "async" searches rely on Solr.
>
> -1.  On "leaking" expectations of async nature into client code.
> Maybe we need more catalog's.  You talk about this below.
> e.g. navigation catalog, fulltext catalog, content catalog, etc.

I strongly support the idea of separate catalogs but am an even bigger
supporter of rethinking some of the catalog requirements entirely. You
don't need a catalog for navigation, folder contents, review queue, etc.
The less there is to index, the better Plone will scale and perform.
Hanno I know you suggested this as well, I just want to vocally add my
support here ;-) By aggressively ripping out indexes and preventing
certain content types from being indexed we have managed to reduce
Data.fs size by orders of magnitude, in one case down from 30GB to less
than a 1GB.

>
> FWIW, We have integrated zc.async into plone.
> https://svn.enfoldsystems.com/public/plone.async.core/
> https://svn.enfoldsystems.com/public/plone.async.indexing/
>
> What should happen is people should be able to write this own
> indexing strategy easily.  An example:
>
>   - Update navigation_catalog immediately (say 3-4 indices)
>   - Create a async Job to update the rest of the catalogs out-of-process
>   - Or possibly update SOLR (NOTE: solr would not need to have comform
>     to the ZCatalog, since there is a function that simply takes the info from
>     the indexing operation and feed it into SOLR.)
>
> > 2. Add information about the "columns" that are used from a specific
> > query to the catalog call. This is equivalent of saying "select a, b
> > from foo" vs. "select * from foo". Solr has a similar mechanism, which
> > allows to specify the data one expects on the returned flares
> > (brains). This can limit the amount of data that needs to be
> > transported in significant ways. In one project the full fledged
> > metadata of an object is about 1kb of XML each - which quickly grows
> > out of hand for multiple queries of 10-20kb for each page view and
> > parsing that data from XML. If all you need is "id", "title" and
> > "path" you shouldn't have to transport and parse the rest.
>
> This would be more of a problem if your trying to reduce the result
> from SOLR.  I'm unsure this is still a huge problem in catalog terms
> if we have more narrowly defined catalogs.
>
> > 3. Carefully review all places we do catalog queries and consider
> > replacing them with non-catalog solutions. I know some people started
> > working on this and I think we need to do it. With Plone 4, blob
> > support and the new plone.folder package there's not many reasons
> > anymore to call the catalog all the time. Filtering for security
> > restrictions is the one thing we need to find a convenient and
> > performant API for. But generally all operations that work "on the
> > current or one specific context / folder" should be done without the
> > catalog.
>
> I agree we should use folder's instead of catalogs for folder listings.
> Although we need to be mindful of what an item in a folder actually
> loads across the wire.
>
> Security is always hard.  Is there hard evidence that the security
> infrastructure is slowing us down?  i.e. in the case of a
> FolderContentsView which is loading objects from container and
> all security checks are being done at API level instead of
> RestrictedPython or sandboxing.

We have developed an alternative folder contents implementation that
indexes allowedRolesAndUsers locally:
http://svn.plone.org/svn/collective/upfront.diet/trunk/upfront/diet/adapters.py

We have done a lot of work in our upfront.diet product to lessen
dependency on the catalog and are ready to split of some of the code
into smaller products that can be tested without swallowing the whole
pill.

>
> > 5. Push forward the __parent__ pointer move. This will help with those
> > queries that need to do a brain.getObject() today or where adding
> > metadata to the catalog would slow it down in general. In these cases
> > the catalog (or external backend) could simply return the list of
> > poids of the objects as a result. Loading a single object from the
> > ZODB is extremely fast and harnesses the ZODB cache. Currently we need
> > to traverse to the object from the root based on its path, so it gets
> > a proper Acquisition context to work with. The __parent__ pointer
> > changes avoid these and you get a fully functional object from a
> > "connection.get(poid)" call without any more wrapping or traversing.
>
> Isn't this wonderful?  No more Acquisition.  8-)
> Some analysts used Acquisition against Plone in their CMS evaluations.
>
> I believe later in this thread elro points that we *must* load the parents
> so the context security can be fully computed.
>
> > 6. Consider splitting out a catalog used for navigation from the
> > general catalog. The navigation tree and sitemap rely on indexing the
> > path and doing queries on that. But they generally need little
> > metadata to be indexed. If we could avoid indexing the path in the
> > normal catalog, moving an object inside the site hierarchy wouldn't
> > invalidate the larger catalog anymore. This relies on getting a good
> > UID story to use as a unique key in the catalog. As long as it uses
> > "getPhysicalPath" internally as its unique key, there's no real
> > advantage.
>
> This is a necessity.  Also we should RECOMMEND people creating
> their own catalog.  This is a much better proposition than people
> adding indexes to CMF/Plone catalog's.  It is more encapsulated
> and makes migration simpler.

I would also recommend that developers do not automatically index each
and every new content type they develop. Not all content types need to
be indexed for content searches. Given that Plone uses the catalog for
almost everything it isn't really possible to prevent indexing of
certain content types without writing your own folder listing or
navigation portlet.


--
Roché Compaan
Upfront Systems                   http://www.upfrontsystems.co.za


------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Hedley Roos Hedley Roos
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

In reply to this post by Alan Runyan-3
>> Note that for any non-trivial site I'd strongly suggest to
>> use experimental.catalogqueryplan. We couldn't run our large sites
>> without this and it makes the catalog a much less urgent bottleneck.
>
> We have used this in the past.  IIRC it was fairly opaque without
> a significant investment.  Would be nice to have some reporting feature
> to know how well its working.  Maybe this is in there now.
>
Way back in time we had a site that could not run adequately without
catalogqueryplan. It is something I just blindly stick into buildout.cfg
these days.

>> 6. Consider splitting out a catalog used for navigation from the
>> general catalog. The navigation tree and sitemap rely on indexing the
>> path and doing queries on that. But they generally need little
>> metadata to be indexed. If we could avoid indexing the path in the
>> normal catalog, moving an object inside the site hierarchy wouldn't
>> invalidate the larger catalog anymore. This relies on getting a good
>> UID story to use as a unique key in the catalog. As long as it uses
>> "getPhysicalPath" internally as its unique key, there's no real
>> advantage.

I'd like this unique key to be unique accross all ZCatalogs. I may want
to combine lazy results from multiple catalog search results. I can't
think of a practical example right now, but there has to be one right :)

>
> This is a necessity.  Also we should RECOMMEND people creating
> their own catalog.  This is a much better proposition than people
> adding indexes to CMF/Plone catalog's.  It is more encapsulated
> and makes migration simpler.
>

Yes. And it is not hard to make your own catalog these days.

------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Lennart Regebro-2 Lennart Regebro-2
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

In reply to this post by Alan Runyan-3
>From my experimentations with Lucene in Plone and CPS 3-4 years ago, I
reached a couple of conclusions:

1. Hierarchical navigation should probably not be handled by the same
utility as search

The effect of using search for navigation is the Extended Path Index,
which is complicated to use and has a weird API, and also is almost
impossible to emulate if you want to put your data into SQL or
Lucene/Solr/whatever. I also did some tests, and storing hierarchical
navigation by some way that simply indexes the hierarchy so you can
say "all objects in this object" can be made simple and super fast.
There should therefore, ideally, be a specific utility to support
hierarchical navigation. It should also index security, and I think
that's pretty much it.

2. We need a better search engine

This search engine may or may not be the catalog, but it should
support proper batching support, sometimes called "Incremental
search". Ie, when you ask for the first 20 search results, you do
*not* generate all 10 000 results, and then drop the last 9980. To
implement this, each index must be a generator, in python terms,
always returning one and only one result: The next match. implementing
this may be a challange. I think also a catalog query plan in
necessary there, to plan which order the indexes are queried for
highest efficiency. Also, the search results object should keep the
"index generators" alive so that you ask for the next batch, you
simply continue from  where you left off, instead of redoing the query
again, and dropping the first 20 results. :)

I don't know how easy it will be to make the current catalog and it's
indexes support this, or if it's even possible.

Also, it should be easy to make external fields, like an SQL database
or having the fulltext in Lucene. This may already be the case, I've
never tried.

--
Lennart Regebro: Python, Zope, Plone, Grok
http://regebro.wordpress.com/
+33 661 58 14 64

------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Hedley Roos Hedley Roos
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general



On 12/05/2010 10:04, Lennart Regebro wrote:

> There should therefore, ideally, be a specific utility to support
> hierarchical navigation. It should also index security, and I think
> that's pretty much it.
>
This exists (as an adapter, not a utility) and it does exactly what you
describe. It uses zope.index to do the indexing and re-uses
allowedRolesAndUsers to index the security.

I override folder contents with overrides.zcml to use this adapter, but
I still have to get folder_listing to work with this adapter.

We use it in production, but I have to split upfront.diet into smaller
packages before I can release an egg.

Hedley

------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
Michael Hierweck Michael Hierweck
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on Solr as a catalog replacement and the catalog in general

In reply to this post by Lennart Regebro-2
Some aspects just FYI:

Lennart Regebro wrote:

>
> 2. We need a better search engine
>
> This search engine may or may not be the catalog, but it should
> support proper batching support, sometimes called "Incremental
> search". Ie, when you ask for the first 20 search results, you do
> *not* generate all 10 000 results, and then drop the last 9980. To
> implement this, each index must be a generator, in python terms,
> always returning one and only one result: The next match. implementing
> this may be a challange. I think also a catalog query plan in
> necessary there, to plan which order the indexes are queried for
> highest efficiency. Also, the search results object should keep the
> "index generators" alive so that you ask for the next batch, you
> simply continue from  where you left off, instead of redoing the query
> again, and dropping the first 20 results. :)
>
> I don't know how easy it will be to make the current catalog and it's
> indexes support this, or if it's even possible.

First:

A product already supporting this kind of "incremental search"
(generator based) already exists:

http://pypi.python.org/pypi/dm.incrementalsearch/2.0

In conjunction with "Advanced Query" it can be used transparently for
catalog queries:

http://pypi.python.org/pypi/Products.AdvancedQuery/3.0

Second:

While trying several approaches to improve the performance of catalog
queries I found out "catalog cache" superceeded all other approaches.
However it can't be considered as stable.

http://pypi.python.org/pypi/collective.catalogcache/0.2

Personal conclusion:

Maybe it would be possible to combine the approaches of query plan,
incremental search and catalog cache.

Along with keeping the "main" portal catalog small (reduze number of
indexes and metadata, let add on products use their own catalogs) and
*optionally* making searchable_text replaceable by solr this might
improve both performance and flexibility.

Michael

------------------------------------------------------------------------------

_______________________________________________
Plone-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-developers
12