Skip to content

Search Tools for Your Web Site

PDI 2008 Workshop Description

What is it?

Various search tools can be used to create a customized search for a Web site. Components may include:

  • user interface (e.g. search box, search results page)
  • server (hardware and software), crawler and indexer

Why use it?

  • Provide users with an additional way to navigate your site
    • Links and navigation menus are the traditional way to navigate the Web
    • Search is especially important for sites with many pages
  • Many other sites have a search box so users expect it
    • Ideally every page on your site should have a site search box
    • Search box is most often found in the page header
  • Improve staff intranet productivity
  • Find errors in order to improve page content
    • Assure that all important content is indexed
  • Create a custom search engine for a group or research topic
    • Can include content from any sites on the Internet

Popular Tools

CSU Libraries Demos

    Google Custom Search Engine (Co-op/CSE)

    • Interface is easy to customize using Libraries template
      • Results are Google-like, with Google Custom Search logo
      • Added code for menu to narrow search to one subdirectory
      • Can search content on multiple servers (lib and digital)
    • Keywords to narrow search
    • Sites/URLs to include or exclude, wildcards allowed
    • Editions: standard has ads, business/university/nonprofits do not
    • Add to Google home page, get code
    • Refinements to label categories in some sites
    • Look and feel of search box and results
    • Code to copy and paste in your search and results pages
    • Collaboration of contributors, invited or volunteers
    • Preview – try out your searches

    Google Mini

    • Turnkey server (hardware and software) in our server room
    • Interface is fairly easy to customize using Libraries template
      • Menu of collections to narrow search
      • No Google branding needed
    • Crawl and Index
      • Crawl URLs - patterns to start, follow, or not crawl
      • Crawl Schedule - continuous or specific days/times
      • Crawler Access - internal, password-protected, proxy servers
      • Collections - groups of URL patterns to search together
    • Serving
      • Front ends - separate interfaces for public, staff, test
        • Output format, KeyMatch, related queries, remove URLs
    • Status and Reports
      • Crawl status - documents found/crawled/served
      • Crawl diagnostics - URLs crawled, excluded or with errors
      • Content statistics - documents by file type
      • Search reports - collections, dates, keywords, queries
    • Administration
      • User accounts - admin or manager, collections, frontends
      • Reset index - clear database and start
      • Import/export configuration - backup all settings
      • System, network, SNMP, certificates, SSL, LDAP, license

    Features

    User Interface

    • Public and staff/restricted interfaces (front ends)
    • Interface of Search and result pages can be customized?
      • page layout, header, footer, colors, styles, ads
    • Faceted search
      • left navigation links to subcategories or topics with fixed # of items
      • e.g. dates, countries, languages, subjects
    • Collections (limit search to specific folders or sets of URLs you define)
    • KeyMatch (staff-suggested URLs for highly-used keywords)
    • Spellchecker ("did you mean...") and suggestions for related terms
    • Advanced search
      • Keywords/phrases (and, or, not, exact phrase, part of word)
      • Limit (to a collection, language, format, domain, or field)
      • Sort (by relevance, date, title, etc.)
      • Output format (# results per page, long/short/URL, group by site)
    • Duplicates/similar items are removed or grouped?
    • XML search results available (for flexible formatting by scripts/XSLT)?

    Crawl and Index

    • Crawl/search multiple domains or hosts
    • URLs to crawl
    • Filters (remove domains or URLs from crawls, indexes or interfaces)
    • File formats indexed (HTML, PDF, Word, Excel, etc.)
    • Crawl frequency (increase/decrease overall or for certain pages/patterns)
    • Usage reports (top queries, top keywords)
    • Crawl reports (URLs crawled/excluded, errors)
    • Helps create files for crawlers? (robots.txt, sitemap.xml)
    • Access to password-protected pages or proxy servers
    • Meta tag information used or ignored?
    • Language and character set support?

    Other Selection Criteria

    • Provider: Commercial? Cost? Licensing? Open source?
    • Limits: # domains, pages, queries; ads, vendor branding
    • Platform: Windows or Unix? Apache or IIS? Programming language?
    • Performance: Searches must be fast or users will go elsewhere
    • Administration: Multiple administrators? Roles?
    • Ease of configuration: GUI-based and/or file-based?
    • Support: phone/email, user community, documentation, training, upgrades, longevity

    Other Resources