Draft article not currently submitted for review.
This is a draft Articles for creation (AfC) submission. It is not currently pending review. While there are no deadlines, abandoned drafts may be deleted after six months. To edit the draft click on the "Edit" tab at the top of the window. To be accepted, a draft should:
It is strongly discouraged to write about yourself, your business or employer. If you do so, you must declare it. Where to get help
How to improve a draft
You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article. Improving your odds of a speedy review To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags. Editor resources
Last edited by Bearcat (talk | contribs) 2 months ago. (Update) |
Developer(s) | Apify |
---|---|
Initial release | 13 July 2022 |
Written in | Typescript, Python |
Operating system | Windows, macOS, Linux |
Type | Web crawler |
License | GPL v3.0 |
Crawlee is a free and open-source web-crawling and browser automation library developed by Apify. The original TypeScript version was first released in 2022, with a Python version added in 2024.
Crawlee's architecture is built around modular crawlers responsible for extracting data from websites. The library follows a declarative programming approach, where users define crawling logic through a structured set of rules. Crawlee uses queues to manage requests; for each request, a specific function is executed to extract data or perform further processing.
Crawlee supports both headless browser sessions (via Playwright and other browser automation software) and plain HTTP request-based scraping.
It also provides various web-scraping-related utilities, such as a sitemap parser[1] and an automatic HTTP proxy manager.
Notable mentions of Crawlee's use in web-crawling projects include GPT Crawler by Builder.io[2] and various generative AI projects maintained by AWS Labs[3].
History
editThe first stable TypeScript version was released in 2021 under the name Apify SDK[4]. This version offered both the open-source crawling framework and the proprietary storage implementation for use on the Apify platform.
In 2022, version v3.0.0 was released[5], renaming the library to Crawlee. This update made Crawlee independent of the Apify Platform, moving most of the Apify-specific features into a separate package (also named Apify SDK).
In 2024, a beta version of Crawlee for Python was released[6].
References
edit- ^ "Release v3.7.0 · apify/crawlee". GitHub. Retrieved 22 September 2024.
- ^ "BuilderIO/gpt-crawler: Crawl a site to generate knowledge files to create your own custom GPT from a URL". GitHub. Retrieved 21 September 2024.
- ^ "awslabs/generative-ai-cdk-constructs: AWS Generative AI CDK Constructs are sample implementations of AWS CDK for common generative AI patterns". GitHub. Amazon Web Services - Labs. 20 September 2024. Retrieved 21 September 2024.
- ^ "Release v1.0.0 · apify/crawlee". GitHub.
- ^ "Release v3.0.0 · apify/crawlee". GitHub.
- ^ "Announcing Crawlee for Python: Now you can use Python to build reliable web crawlers | Crawlee · Build reliable crawlers. Fast". crawlee.dev. 5 July 2024.