General Tips what to do and not to do and also Nice Imp Notes
Do or Recommended
1) For SharePoint site content source if you want to crawl the content on a particular site collection on a different schedule than other site collections then Crawl only the SharePoint site of each start address. This option accepts any URL, but will start the crawl from the top-level site of the site collection that is specified in the URL you enter. For example, if you enter http://contoso/sites/sales/car but http://contoso/sites/sales is the top-level site of the site collection, the site collection http://contoso/sites/sales and all of its subsites are crawled.
2) For SharePoint site content source if you want to crawl all content in all site collections in a particular Web application on the same schedule then Crawl everything under the host name of each start address. This option accepts only host names as start addresses, such as http://contoso. You cannot enter the URL of a subsite, such as http://contoso/sites/sales when using this option.
3) For Web sites content source if relevant content is only the first page then crawl only the first page of each start address
4) For Web sites content source if you want to limit how deep to crawl the links on the start addresses then Custom — specify the number of pages deep and number of server hops to crawl. We recommend you start with a small number on a highly connected site because specifying more than three pages deep or more than three server hops can crawl the entire Internet. You can also use one or more crawl rules to specify what content to crawl
5) For File shares or Exchange public folders content source if Content available in the subfolders is not likely to be relevant then Crawl only the folder of each start address
6) File shares or Exchange public folders Content Source type if Content in the subfolders is likely to be relevant then Crawl the folder and subfolder of each start address
1) You cannot crawl the same address using multiple content sources. For example, if you use a particular content source to crawl a site collection and all its subsites, you cannot use a different content source to crawl one of those subsites on a different schedule. For performance reasons, you cannot add the same start addresses to multiple content sources
1) Content is any item that can be crawled, such as a Web page, a Microsoft Office Word document, business data, or an e-mail message. Content resides in a content repository, such as a Web site, file share, or SharePoint site.
2) A content source is a set of rules that tells the crawler where it can find content, how to access the content, and how to behave when it is crawling the content. It includes one or more addresses of a content repository from which to start crawling, also called start addresses. These settings apply to all start addresses within the entire content source.
3) Type of Content Repository – Sites, SharePoint sites, Exchange Folder, Network Folders, BDC
4) Each content source contains a list of start addresses that the crawler uses to connect to the repository of content
5) When the crawler accesses the start addresses listed in a content source, the crawler must be authenticated by and granted access to the servers that host that content. The user account that is used by the crawler must have at least read permission to crawl content
6) The crawler in uses protocol handlers to access content and then IFilters to extract content from files that are crawled. IFilters remove application-specific formatting before the engine indexes the content of a document. Only file types for which a protocol handler and IFilter are installed are crawled by Office SharePoint Server 2007
7) The crawler uses protocol handlers and IFilters as follows:
· The crawler retrieves the start addresses of content sources and calls the protocol handler based on the URL’s prefix.
· The protocol handler connects to the content source and extracts system-level metadata and access control lists information.
· The protocol handler identifies the file type of each content item, based on the file name extension, and calls the appropriate IFilter associated with that file type.
· The IFilter extracts content, removing any embedded formatting, and then retrieves content item metadata.
· Content is parsed by one or more language-appropriate word breakers and is added to the content index, also called the full-text index. Metadata and access control lists are added to the search database.
8) If there is no IFilter for a file type that you want to crawl, the content index in SharePoint can only include the file’s properties, and not the file’s content. If you want to index content that does not have an IFilter installed by default, you have to install and register an IFilter for that file type, for example for PDF
9) Content is only crawled if the relevant file name extension is included in the file-type inclusions list and an IFilter is installed on the index server that supports those file types
10) The crawler uses protocol handlers to access content. When creating a content source, shared services administrators specify the protocol handler that the crawler will use when crawling the URLs specified in that content source
11) Default Protocol handlers for example – http, https, bdc, sps, sps3, bdc2, sts, rb
12) If you want to crawl content that does not have a protocol handler installed, you must install a third-party or custom protocol handler before you can crawl that content. Several third-party protocol handlers
1) Database Server: The index server writes metadata that it collects from crawled documents into tables on the database server. When Indexer Performance is set to Maximum, the index server can generate data at a rate that overloads the database server. This can affect the performance of other applications that are using the same database server. It can also affect the performance of other shared services that are running under the shared services provider (SSP), such as Excel Calculation Services.
2) Index server: Indexing can place considerable demands on index server resources such as the disk, processors, and memory. An index server must have sufficient hardware to accommodate the amount of indexing required by your organization
3) Web Front End Server: To crawl content on local SharePoint sites, the index server sends requests to Web front-end servers that host the content. Such requests consume resources on the Web front-end servers and can thus reduce the responsiveness of the SharePoint sites that are hosted on these servers for end users.
4) Monitoring server performance during crawls can help you determine the appropriate setting for Indexer Performance. We recommend that you conduct your own testing to balance crawl speed, network latency, database load, and the load on crawled servers.
5) Consider the following suggestions regarding adjusting the Indexer Performance setting:
· If you are using the index server and database server only for searching (using the Office SharePoint Server Search service), you might want to set the Indexer Performance level to Maximum and note how this affects your database server performance. If the increase in database server CPU utilization exceeds 30 percent, we recommend changing the Indexer Performance level to Partly reduced.
· If the index server and database server are shared across multiple services, such as the Office SharePoint Server Search service and Excel Calculation Services, we recommend that you select the Partly reduced or Reduced setting for Indexer Performance.
6) Manager Crawler Impact: Content crawls can place a significant load on crawled servers and thereby adversely affect response times for server users. Therefore, we recommend that you use crawler impact rules to specify how aggressively your crawler should perform. A search services administrator can manage the affect of the crawler on a crawled site by using a crawler impact rule to specify one of the following:
· The maximum number of documents that the crawler can request at a time from the specified site.
· The frequency with which the crawler can request any particular document from the specified site.
7) Try to avoid crawling internal servers at peak load times
8) You can increase or limit the quantity of content that is crawled by using:
· Crawl settings in the content sources For example, you can specify to crawl only the start addresses that are specified in a particular content source, or you can specify how many levels deep in the namespace (from those start addresses) to crawl and how many server hops to allow. Note that the options that are available within a content source for specifying the quantity of content that is crawled vary by content-source type.
· File type inclusions You can choose the file types that you want to crawl.
· Crawl rules You can use crawl rules to exclude all items in a given path from being crawled. This is a good way to ensure that subsites that you do not want to index are not crawled with a parent site that you are crawling. You can also use crawl rules to increase the amount of content that is crawled — for example crawling complex URLs for a given path.