This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Push URL into WebConnector

We would like to push a URL into WebConnector when it's created/updated within our CMS.  Is there a way to push a URL into WebConnector compared to having it poll for what to index?  I'm trying to avoid pushing directly into CFS.   Reproducing the parsing and all the logic within WC is not something I want to take on.

  • 0  

    Please use the Fetch action of the Web Connector.

    https://www.microfocus.com/documentation/idol/IDOL_12_0/WebConnector/Help/index.html#Actions/Fetch/_CN_Synchronize.htm?TocPath=Reference|Actions|Fetch|_____3

  • 0 in reply to   

    Thanks - looking at that option it appears that will kick off a sync for a fetchtask.  There is an option to specify an identifier but unless that identifier can be be a URL I'm looking to insert new content without recrawling the whole site or even a section.

    Can the call look something like

    host:port/action=Fetch&FetchAction=Synchronize&Identifiers=<new_url>

    Or do I want to create a dummyFeeder fetchtask and use 

    host:port/action=Fetch&FetchAction=Synchronize&TaskSection=Feeder&[Feeder]Url0=<new url>

     

  • 0   in reply to 

    You will need to create the Config parameter to set the new URL and other parameters such as MaxPages=1 so that only that page is crawled. 

    The config parameter expects the WebConnector.cfg task config section as a base 64 encoded string 

    The configuration below will only index one URL. So, if the following is the FetchTask

    [BPay]
    //The url(s) from which to start the crawl.
    IndexDatabase=Finacial_Industry
    //Regexes to restrict pages that are crawled for links. If a page is not crawled, it will not be indexed.
    SpiderUrlMustHaveRegex=
    SpiderUrlCantHaveRegex=.*\.css$|.*.css\?.*$|.*\.js$|.*\.js,.*$|.*.js\?.*|.*\/ScriptResource\.axd.*|.*\/WebResource\.axd.*

    //Regexes to restrict pages that are indexed.
    UrlMustHaveRegex=.*\BAY
    UrlCantHaveRegex=
    PageMustHaveRegex=
    PageCantHaveRegex=
    ContentTypeMustHaveRegex=
    ContentTypeCantHaveRegex=(application|text)/(javascript|xml|x-javascript|css)(;.*)?

    //The delay between processing pages, per sync thread
    PageDelay=5s

    //If StayOnSite=true, all links that are followed on a page must be on the same in order site to be crawled
    StayOnSite=true

    //Maximum ammount of time to spend crawling, 0 indicating unlimited
    SiteDuration=0s

    //Maximum number of pages to ingest per synchronize run, 0 indicating unlimited
    MaxPages=1
     
    Then the following is the base 64 encoding 
     
    W0JQYXldCi8vVGhlIHVybChzKSBmcm9tIHdoaWNoIHRvIHN0YXJ0IHRoZSBjcmF3bC4KVXJsPWh0dHBzOi8vZW4ud2lraXBlZGlhLm9yZy93aWtpL0JQQVkKSW5kZXhEYXRhYmFzZT1GaW5hY2lhbF9JbmR1c3RyeQovL1JlZ2V4ZXMgdG8gcmVzdHJpY3QgcGFnZXMgdGhhdCBhcmUgY3Jhd2xlZCBmb3IgbGlua3MuIElmIGEgcGFnZSBpcyBub3QgY3Jhd2xlZCwgaXQgd2lsbCBub3QgYmUgaW5kZXhlZC4KU3BpZGVyVXJsTXVzdEhhdmVSZWdleD0KU3BpZGVyVXJsQ2FudEhhdmVSZWdleD0uKlwuY3NzJHwuKi5jc3NcPy4qJHwuKlwuanMkfC4qXC5qcywuKiR8LiouanNcPy4qfC4qXC9TY3JpcHRSZXNvdXJjZVwuYXhkLip8LipcL1dlYlJlc291cmNlXC5heGQuKgoKLy9SZWdleGVzIHRvIHJlc3RyaWN0IHBhZ2VzIHRoYXQgYXJlIGluZGV4ZWQuClVybE11c3RIYXZlUmVnZXg9LipcQkFZClVybENhbnRIYXZlUmVnZXg9ClBhZ2VNdXN0SGF2ZVJlZ2V4PQpQYWdlQ2FudEhhdmVSZWdleD0KQ29udGVudFR5cGVNdXN0SGF2ZVJlZ2V4PQpDb250ZW50VHlwZUNhbnRIYXZlUmVnZXg9KGFwcGxpY2F0aW9ufHRleHQpLyhqYXZhc2NyaXB0fHhtbHx4LWphdmFzY3JpcHR8Y3NzKSg7LiopPwoKLy9UaGUgZGVsYXkgYmV0d2VlbiBwcm9jZXNzaW5nIHBhZ2VzLCBwZXIgc3luYyB0aHJlYWQKUGFnZURlbGF5PTVzCgovL0lmIFN0YXlPblNpdGU9dHJ1ZSwgYWxsIGxpbmtzIHRoYXQgYXJlIGZvbGxvd2VkIG9uIGEgcGFnZSBtdXN0IGJlIG9uIHRoZSBzYW1lIGluIG9yZGVyIHNpdGUgdG8gYmUgY3Jhd2xlZApTdGF5T25TaXRlPXRydWUKCi8vTWF4aW11bSBhbW1vdW50IG9mIHRpbWUgdG8gc3BlbmQgY3Jhd2xpbmcsIDAgaW5kaWNhdGluZyB1bmxpbWl0ZWQKU2l0ZUR1cmF0aW9uPTBzCgovL01heGltdW0gbnVtYmVyIG9mIHBhZ2VzIHRvIGluZ2VzdCBwZXIgc3luY2hyb25pemUgcnVuLCAwIGluZGljYXRpbmcgdW5saW1pdGVkCk1heFBhZ2VzPTEK
     
    So you can specify all the config parameters properly for your job using the above approach. 
  • Verified Answer

    0 in reply to   

    Vinay,

    Thank you for your response.  The first answer got me where I needed.  The final solution that we will be implementing is as follows.

    To add a URL to the IDOL database issue the following command:

     

     

    webConnectorServer:7006/.../resourceNum2&t=2

     

    Note that the URLs supplied need to be URLEncoded.

    The fetch task is defined as follows:

     

    [immediate] url0=https://our.website.com/ IngestEnableDeletes=true StayOnSite=true IndexDatabase=db1 Depth=0 SpiderUrlMustHaveRegex=(.*/our\.website\.com/.*) SpiderUrlCantHaveRegex=(.*\.css$)|(.*.css\?.*$)|(.*\.js$)|(.*\.js,.*$)|(.*.js\?.*)|(.*.zip.*)|(.*\?utm.*)|(.*&utm.*) UrlCantHaveRegex=(.*\.css$)|(.*.css\?.*$)|(.*\.js$)|(.*\.js,.*$)|(.*.js\?.*)|(.*.zip.*)|(.*\?utm.*)|(.*&utm.*) UserAgent=IDOL12-immediate

     

    You only need to use base64 encoding if you are supplying a whole fetch task.

    Note to find the port number look in the webconnector.cfg file in the [server] section for the port number.  You may need to add a line for Access-Control-Allow-Origin=<cms-server-name>  (or * if you don't mind who can affect your webconnector (not recommended for production!)