This list contains PHP libraries related to web scraping and data processing
- PHP Web Scraping
- Network
- Web-scraping Frameworks
- HTML/XML Parsing
- Text processing
- Specific Formats Processing
- Natural Language Processing
- Browser automation and emulation
- Multiprocessing
- Queue
- Cloud Computing
- URL Manipulation
- Web Content Extracting
- Asynchronous
- WebSocket
- DNS Resolving
- Computer Vision
- Geocoding
- API Clients
- Other PHP Lists
- Guzzle - A comprehensive HTTP client.
- Buzz - Another HTTP client.
- Requests - A simple HTTP library.
- HTTPFul - A chainable HTTP client.
- Goutte - A simple web scraper.
- PHP Spider - A comprehensive web spider.
- Crawler - (crwlr) - Library for Rapid (Web) Crawler and Scraper Development
- Roach - It is port of the popular Scrapy package for Python. Include adapter to Laravel and Symfony
- HTML5 PHP - An HTML5 parser and serializer library.
- QueryPath - a jQuery-like library for working with XML and HTML documents in PHP. It now contains support for HTML5 via the HTML5-PHP project.
- DiDOM - super fast HTML parser (because it was build on top of plain PHP).
- PHPScraper - an highly opinionated web-interface.
- DomCrawler - (Symfony) - The DomCrawler component eases DOM navigation for HTML and XML documents.
Libraries for parsing and manipulating plain texts.
- General
- ANSI to HTML5 - An ANSI to HTML5 converter library.
- Patchwork UTF-8 - A portable library for working with UTF-8 strings.
- Hoa String - Another UTF-8 string library.
- Stringy - A string manipulation library with multibyte support.
- Color Jizz - A library for manipulating and converting colours.
- Text - A text manipulation library.
- Flux - A regular expression building library.
- Transliteration
- User-agent
- CrawlerDetect - CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header.
- PHPUserAgent - A simple, streamlined PHP user-agent parser!
- AgentZero - A library for extracting information from User-Agent strings very fast.
- Device Detector - Another library for parsing user agent strings.
- Mobile-Detect - A lightweight PHP class for detecting mobile devices (including tablets).
- UA Parser - A library for parsing user agent strings.
- Unites of measure
- ByteUnits - A library to parse, format and convert byte units in binary and metric systems.
- PHP Units of Measure - A library for converting between units of measure.
- PHP Conversion - Another library for converting between units of measure.
- Phone number
- LibPhoneNumber for PHP - A PHP implementation of Google's phone number handling library.
Libraries for parsing and manipulating specific text formats.
- CSV
- CSV - A CSV data manipulation library.
- Office
- PHPWord - A library for working with Microsoft Word documents.
- PHPExcel - A library for working with Microsoft Excel documents.
- PHPPowerPoint - A library for working with Microsoft PowerPoint documents.
- ExcelAnt - A library for manipulating Microsoft Excel documents.
- Markdown
- PHP Markdown - A Markdown parser.
- CommonMark PHP - A Markdown parser which supports the full CommonMark spec.
- Parsedown - Another Markdown parser.
- Ciconia - Another Markdown parser that supports Github flavoured Markdown.
- Cebe Markdown - An fast and extensible Markdown parser.
- BBCode
- Decoda - A lightweight lexical string parser for BBCode styled markup.
- JSON
- JsonMapper - A library that maps nested JSON structures onto PHP classes.
- vCard
- vobject - The VObject library allows you to easily parse and manipulate iCalendar and vCard objects.
- File Type Detection
- Hoa Mime - Another MIME detection library.
- Canal - A library to determine internet media types.
- Apache MIME Types - A library that parses Apache MIME types.
- GeoJSON
- GeoJSON - A GeoJSON implementation.
Libraries for working with human languages.
- PHP NlpTools - Natural Language Processing Tools in PHP
- nlpTools - Natural Language Processing Toolkit for PHP
- php-webdriver - A php client for webdriver.
- PHP PhantomJS - Execute PhantomJS commands through PHP
- Mink - universal API for multiple browser emulators (selenium, zombie.js, goutte)
- Spork - A process forking library.
Libraries for asynchronous networking programming.
- React - An event driven non-blocking I/O library.
- Rx.PHP - A reactive extension library.
- Hoa EventSource - An event source library.
- Evenement - An event dispatcher library.
- Event - An event library with a focus on domain events.
- Broadway - An event source and CQRS library.
- Pheanstalk - A Beanstalkd client library.
- PHP AMQP - A pure PHP AMQP library.
- Thumper - A RabbitMQ pattern library.
- Bernard - A multibackend abstraction library.
- TODO
Libraries for parsing email.
- Email Reply Parser - An email reply parser library.
- Email Validator - A small email address validation library.
Libraries for parsing URLs.
- Purl - A URL manipulation library.
- PHP Domain Parser - A domain suffix parser library.
- Uri (The PHP League) - A simple URL manipulation library (PSR-7 compatible).
- Url (crwlr) - Swiss Army knife for urls.
- Text and Meta Data from Web Documents
- Video
- Youtube-Downloader - PHP script for downloading videos from youtube; also parsing youtube feed into RSS enclosures for podcatchers
Libraries for working with WebSocket.
- Ratchet - A web socket library.
- Hoa WebSocket - Another web socket library.
- Elephant.io - Yet another web socket library.
- Net_DNS2 - Native PHP DNS Resolver and Updater
- OpenCV-for-PHP - An OpenCV binding for PHP