fabpot/goutte

Goutte (pronounced 'goot', French for 'drop') is a popular PHP web scraping and crawling library. Developed by Fabien Potencier, the creator of the Symfony framework, Goutte provides a simple and elegant API for programmatically interacting with websites. It is built upon several robust Symfony components, including:

* Symfony BrowserKit: Simulates a web browser, allowing you to make requests, click links, and submit forms without a real browser.
* Symfony DomCrawler: Provides an easy way to navigate HTML and XML documents using CSS selectors or XPath expressions, making data extraction straightforward.
* Symfony HTTPClient (or GuzzleHTTP for older versions): Handles the underlying HTTP requests, managing connections, headers, and responses.

Key Features and How it Works:

1. Simulated Browser Interaction: Goutte acts like a headless browser. You create an instance of `Goutte\Client`, which then allows you to send `GET` or `POST` requests to URLs.
2. Powerful Data Extraction: When a request is made, Goutte returns a `Crawler` object. This object is the heart of data extraction. You can use its `filter()` method with CSS selectors (like jQuery) or XPath to target specific HTML elements.
3. Navigation and Form Submission: Beyond simple data retrieval, Goutte can simulate user actions. You can find links (`filter('a')->link()`) and click them (`$client->click($link)`), or find forms (`filter('form')->form()`), fill in fields, and submit them (`$client->submit($form)`).
4. Integration: Being built on Symfony components, Goutte integrates well within Symfony applications but is equally effective as a standalone library for any PHP project.
5. Error Handling: It allows for robust error handling for network issues or missing elements.

Goutte simplifies complex web scraping tasks by abstracting away the intricacies of HTTP requests and DOM parsing, providing a clean and intuitive interface for developers.

Example Code

```php
<?php

require 'vendor/autoload.php';

use Goutte\Client;
use Symfony\Component\HttpClient\HttpClient;

// --- Installation via Composer ---
// If you haven't already, install Goutte using Composer:
// composer require fabpot/goutte
// This will also install its dependencies like Symfony components.

// Create a new Goutte client instance
// For newer Goutte versions (>=4.0), you might need to pass an HTTP client
// If not provided, Goutte will try to autodiscover one, or fall back to Guzzle.
// Using Symfony HttpClient is recommended:
$httpClient = HttpClient::create();
$client = new Client($httpClient);

// --- Example 1: Basic Web Scraping (GET Request) ---

// Make a GET request to a sample Wikipedia page
// The request method returns a Crawler object
$crawler = $client->request('GET', 'https://en.wikipedia.org/wiki/Web_scraping');

// Extract the main title using a CSS selector
// filter() returns another Crawler object, then text() gets the content
try {
    $pageTitle = $crawler->filter('h1#firstHeading')->text();
    echo "Page Title: " . $pageTitle . "\n\n";
} catch (\InvalidArgumentException $e) {
    echo "Could not find the page title.\n\n";
}

// Extract the text from the first paragraph within the main content area
// Using nth-child(2) to get the second paragraph after the table of contents typically
// Note: CSS selectors can be fragile and might need adjustment based on page changes.
try {
    $firstParagraph = $crawler->filter('#mw-content-text > div.mw-parser-output > p')->eq(0)->text();
    echo "First Paragraph Snippet: " . substr($firstParagraph, 0, 200) . "...\n\n";
} catch (\InvalidArgumentException $e) {
    echo "Could not find the first paragraph.\n\n";
}

// --- Example 2: Extracting a Specific Attribute ---

// Let's try to get the 'src' attribute of the first image in the content
try {
    $firstImageSrc = $crawler->filter('#mw-content-text img')->eq(0)->attr('src');
    // Wikipedia often uses protocol-relative URLs, so we might need to prepend 'https:'
    if (strpos($firstImageSrc, '//') === 0) {
        $firstImageSrc = 'https:' . $firstImageSrc;
    }
    echo "First Image Source: " . $firstImageSrc . "\n\n";
} catch (\InvalidArgumentException $e) {
    echo "Could not find the first image or its 'src' attribute.\n\n";
}

// --- Example 3: Iterating through elements ---

echo "Some internal links on the page:\n";
$crawler->filter('a[href^="/wiki/"]')->each(function ($node, $i) {
    if ($i < 5) { // Limit to first 5 links for brevity
        $linkText = trim($node->text());
        $linkHref = $node->attr('href');
        if (!empty($linkText) && !empty($linkHref)) {
            echo "- " . $linkText . " (Link: " . $linkHref . ")\n";
        }
    }
});

echo "\nScraping completed.";

```

Example Code

Related Topics