Crawl and Scrape Websites

Power-Chat draws on knowledge stored in wiki articles. When that knowledge already exists on external websites, you can bring it into Unusual Suite by scraping individual pages or crawling an entire site.

Scrape a single webpage

Scraping imports the content of one external page into a wiki article. To scrape a page:

  1. Create a new wiki article.
  2. Fill in the External link field with the URL of the webpage.
  3. Click the 'Robot' icon to the right of the External link field.

Unusual Suite then scrapes the page content, removes any header, footer, menu, or navigational elements, and compiles a summary that appears when the article is found in wiki search results.

You can scrape a page again at any time to import an updated version of the content. Scraping can be applied to any node in the wiki tree. There is also an option to scrape all sub-nodes of the currently selected tree node at the same time.

Warning: review the scraped content after importing. In rare cases the scraped content may not match the source webpage exactly.

Crawl a complete website

The 'Robot' icon in the wiki interface also lets you crawl an entire website. When you start a crawl:

  • All pages linked from the start URL that belong to the same domain are crawled and scraped.
  • A separate wiki article is created for each page found, placed under the currently selected node in the wiki navigation tree.
  • You can configure the wiki article type assigned to newly created articles, and whether those articles should be published immediately after creation.
  • Unusual Suite attempts to extract the correct subject for each new wiki article from the page content.
  • When a large number of articles is created, Unusual Suite groups them into sub-nodes by the first letter of the subject.

Warning: monitor the crawling process as it runs. Crawling results are not always deterministic. Reviewing the structure and content of the resulting sub-tree is mandatory.