Online purchasing has grown tremendously during the past several years and continues to expand at a rapid pace. The ability to choose from thousands of products literally at one’s fingertips has eased the life of the shopper. However, the myriad choices now available have made it increasingly difficult for the customer to reach the product in which he truly is interested. The potential buying value of web-based purchasing degrades as the customer settles for a product that fails to meet his exact requirements.

Perhaps the most significant challenge faced by the information technology community today is to supply high quality accurate information to the searcher.  The proper search engine is of primary importance in compiling a list of products based on the customer’s query and eliminating those that fail to meet the search criteria. The customer then is able to complete his purchase from a select group of appropriate choices, thus saving time and money. In support of this enterprise, Tricom Document Management has extended its horizon by managing and executing a system for scraping and structuring valuable information from different websites. The key to this activity is “Web Crawler”.

What is web crawling?

Web crawling is a program that harvests listings from websites that are further used by search engines.  The crawler begins at a website’s home page and traverses through the entire site to scrape content and metadata in order to generate a feed. This feed can be indexed by the search engine for fast and easy retrieval. 

Execution Model

The model of execution includes:

  1. Controller Module - This module focuses on the Graphical User Interface (GUI) designed for the web crawler and is responsible for controlling the operations of the crawler. The GUI enables the user to enter the start URL, enter the maximum number of URL’s to crawl, and view the URL’s that are being fetched. It controls the Fetcher and Parser.
  2. Fetcher Module - This module starts by fetching the page according to the start URL specified by the user. The fetcher module also retrieves all the links in a particular page and continues doing that until the maximum number of URL’s is reached.
  3. Parser Module - This module parses the URL’s fetched by the Fetcher module and saves the contents of those pages to the disk.

Domain of Current Crawlers

Tricom crawlers constantly update the search performance by crawling more than a million listings every day.  The domain variety includes:

  • Job sites
  • Vacation rentals
  • Retailers
  • Real estate
  • Merchandise
  • Vehicles
  • Pets
  • Tickets and events
  • Rental Housing
  • Personals
  • And many more…..

Features and Services

  • Independence - Unlike many organizations that feature one common application for all websites, Tricom develops customized crawlers that extract information only for the site for which the crawler has been developed. The programmed crawler can better understand the prerequisites of extraction for this site and thus can deliver extremely accurate results.
  • Stability  - The developed crawlers are robust and can work even if there are significant changes to the site.
  • Scalability - The existing scripts for crawling can include sections of the site automatically or with minor modifications. The modularity of code makes the scalability factor feasible.
  • Security  - Tricom’s crawlers obey robot exclusion standards and thus do not breach the security of the web server. The crawler adheres to the rule of not extracting information from restricted areas, thus maintaining privacy.
  • Speed  - The core development of the crawling mechanism already has been developed. Any modifications pertaining to a specific site in terms of fields and content can be accomplished quickly.  The final deliverables are produced as per the client’s requirements.
  • Accuracy  -The crawlers developed by Tricom generate feeds with a content accuracy of 100% in most case and more than 98% in the rest. Since the crawlers are site specific the error rate and failure to parse is negligible.

Once the crawler has been developed, any changes in the site for which it was designed may require modifications. The frequency of modification depends on the quantum and frequency of changes in the site. In each case Tricom will modify the script to include the changes.

If you require the following capabilities, Tricom’s solution will allow you to:

  • crawl any site irrespective of the size of the store
  • extract data only from targeted sites to save in your database
  • generate feeds in less than a day
  • save time and money
  • retain up-to-date listings of searches

Among the many e-commerce sites Tricom has crawled are:

  • www.Shopbop.com
  • www.Vacationrentals.com
  • www.A1vacation.com
  • www.Vast.com
  • www.Nordstrom.com
  • www.Bluefly.com

The Cost-effective Solution

The crawlers developed by Tricom are extremely cost-effective. Fees are based on either the manpower or the number of sites to be scraped. The charges for our services are among the lowest in the industry.

Areas of Specialization

  • Web Development
  • Web Services
  • Indexing
Skill Sets
      
Languages:  Perl -CGI , Object Oriented PerlShell scripting XMLPHP

Databases:  MySQL & Oracle

Our Capabilities
                  
Our primary expertise is in delivering web services for which we use Perl, MySQL  & XML. However, we have the capability to execute projects on PHP, Oracle, Shell scripting, C++, C, Unix/Linux and Java.

Perl CGI basically is a Perl interface to databases and other data storing sources that can be used for generating web pages on the fly. Tricom maintains an excellent skill set for executing projects in the same manner. 

Tricom can work on LAMP projects, develop web application projects, develop sites in Perl CGI and provide many types of CGI implementations, including script customization, custom programming and script installation.

  • Perl script customization and custom programming: Tricom’s programmers can customize most Perl scripts to client requirements. We can modify most scripts written by other programmers or companies if allowed by their copyright documentation. We also can create custom scripts.

Perl script installation: Tricom provides a Perl script installation service. The client selects the script and we perform the installation. We also can install Perl scripts for Unix.