Funnelweb question

classic Classic list List threaded Threaded
2 messages Options
Dylan Jay-4 Dylan Jay-4
Reply | Threaded
Open this post in threaded view
|

Funnelweb question


On 07/03/2013, at 9:12 AM, [hidden email] wrote:

>
> Hi
>
> My brother works with the Universities Library, and they are moving to wordpress for their webpages.
>
> They are moving their current content MANUALLY (thousands of pages), and to show them how insane this is (and maybe the choice of Wordpress, too), I want to show him how to migrate with Funnelweb.
>
> This is a typical page:
> http://www.ub.uib.no/avdeling/spes/magasin/spesialsamlingssider/librar.htm
>
> ?. but  I can not find any usefull div/class to use to extact as "body text" .. its all "tables" (and until now, Funnelweb has worked "out of the box for me")
>
> Maybe it is possible to somehow remove  some of the <tr> and <td>s. It looks like the body text is usually in the second <td> of the second <tr>

"usually" is the key word. You have to find the pattern or groups of patterns

you can use xpath like //tr[2]/td[2] which is "in the second <td> of the second <tr>"

Sometimes I've used width values if they are unique enough.

You can also do some more advanced xpath by looking inside elements
For example if it's td with the h2 in it you can use //td[.//h2]

Remember you can also use multiple template sections and if one of the compulsory elements don't match it will go on to the next template.

You can also enable transmogrify.htmlcontentextractor.auto [1] which will try to find the xpath for you.


[1] https://pypi.python.org/pypi/transmogrify.htmlcontentextractor/1.0


>
> Espen


------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to
tackle endpoint security challenges, access the full report.
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Plone-Users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-users
espen espen
Reply | Threaded
Open this post in threaded view
|

Re: Funnelweb question

Hi and thanks for the answer

7. mars 2013 kl. 00:57 skrev "Dylan Jay-4 [via Plone]" <[hidden email]>:


On 07/03/2013, at 9:12 AM, <a href="x-msg://940/user/SendEmail.jtp?type=node&amp;node=7563687&amp;i=0" target="_top" rel="nofollow" link="external">[hidden email] wrote:

>
> Hi
>
> My brother works with the Universities Library, and they are moving to wordpress for their webpages.
>
> They are moving their current content MANUALLY (thousands of pages), and to show them how insane this is (and maybe the choice of Wordpress, too), I want to show him how to migrate with Funnelweb.
>
> This is a typical page:
> http://www.ub.uib.no/avdeling/spes/magasin/spesialsamlingssider/librar.htm
>
> ?. but  I can not find any usefull div/class to use to extact as "body text" .. its all "tables" (and until now, Funnelweb has worked "out of the box for me")
>
> Maybe it is possible to somehow remove  some of the <tr> and <td>s. It looks like the body text is usually in the second <td> of the second <tr>
"usually" is the key word. You have to find the pattern or groups of patterns

you can use xpath like //tr[2]/td[2] which is "in the second <td> of the second <tr>"

Sometimes I've used width values if they are unique enough.

You can also do some more advanced xpath by looking inside elements
For example if it's td with the h2 in it you can use //td[.//h2]

This is very useful: I googled a little and found also [contains(@class,'someclass')]
Does this work the same way, and how "deep does this go

(lets say you have 
<div><span>span><h2>title</h2></span>/<span></div>
and
<div><h2>title</h2></div>

Would a rule like //div[.//h2] work for both ?


That said, I still can not get it to work without using (css) classes / ids (importing another plone site using ids works great), could it be something wrong with xml on my OS X (missing libs or something)

I made this (test) pipeline, and I can not understand why it add "blank pages (title field is empty, body text field is empty( " (all pages has "title" and "body" so something should  show up (?))
________

[transmogrifier]
include = funnelweb.remote

[crawler]

[template1]
title       = text //title
description = optional //nothing
text        = html //body

[ploneupload]

__________


Remember you can also use multiple template sections and if one of the compulsory elements don't match it will go on to the next template.

You can also enable transmogrify.htmlcontentextractor.auto [1] which will try to find the xpath for you.


[1] https://pypi.python.org/pypi/transmogrify.htmlcontentextractor/1.0


>
> Espen


------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to
tackle endpoint security challenges, access the full report.
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Plone-Users mailing list
<a href="x-msg://940/user/SendEmail.jtp?type=node&amp;node=7563687&amp;i=1" target="_top" rel="nofollow" link="external">[hidden email]
https://lists.sourceforge.net/lists/listinfo/plone-users



If you reply to this email, your message will be added to the discussion below:
http://plone.293351.n2.nabble.com/Funnelweb-question-tp7563687.html
To start a new topic under General Questions, email [hidden email]
To unsubscribe from Plone, click here.
NAML