Looter: Web-Scraping for Humans!¶

A super-lightweight crawler tool.

automatically generate spider with template
blazing fast speed with concurrent.futures or asyncio
provides shell to debug your spider
easy web content extracting with parsel (just the same as Scrapy)
built in some useful functions
provides some examples for you to start

Installation¶

$ pip install looter

Only Python 3.6 is supported.

Quick start¶

Here’s a very simple image crawler: First, open it with shell

$ looter shell https://konachan.com/post

Then, you can extract all the urls of the images

>>> imgs = tree.css('a.directlink::attr(href)').extract()

Save these urls to ‘konachan.txt’

>>> Path('konachan.txt').write_text('\n'.join(imgs))

Then, you can use wget to download these images to your local disk :)

$ wget -i konachan.txt

Workflow¶

If you want to quickly write a spider, you can use looter to automaticaly generate one :)

$ looter genspider <name> [--async]

async is an option which represents generating a spider using asyncio instead of threadpool.

In the generated template, you can custom the domain, and the tasklist.

What is tasklist? Actually it is the pages you want to crawl and that’s it.

You can simply use list comprehension to make your own tasklist, using konachan.com as example:

domain = 'https://konachan.com'
tasklist = [f'{domain}/post?page={i}' for i in range(1, 9777)]

And then you should custom your crawl function, which is the core of your spider.

def crawl(url):
    tree = lt.fetch(url)
    items = tree.css('ul li')
    for item in items:
        data = {}
        # data[...] = item.css(...)
        pprint(data)

In most cases, the contents you want to crawl is a list (ul or ol tag in HTML), you can select them as items.

Then, just use a for loop to iterate them, and select the things you want, storing them to a dict.

Notice: looter use parsel to parse the HTML, just the same as Scrapy.

But before you finish this spider, you’d better debug your codes using shell provided by looter.

>>> items = tree.css('ul li')
>>> item = items[0]
>>> item.css(anything you want to crawl)
# Pay attention to the outputs!

After debugging, your spider is done. Very simple, isn’t it :)

There are many example spiders written by author.

Functions¶

view¶

Before crawling a page, you’d better check whether it’s rendered properly

>>> view(url)

save¶

Save what you crawled as a file, supports sorting and duplicate removal.

>>> total = [...]
>>> save(total, sort_by='key', no_duplicate=True)

If you want your data to be converted to csv, just set the file extension to csv, but you should have pandas installed for convertion.

Summary¶

By sniffing, confirm whether the website has its own api. If it does, take it! If not, go to the next step.
Confirm whether the website is static or dynamic (with or without JS loading, whether requires login, etc), the methods are: observation, sniffing and looter ‘s view function
If the website is static, use ‘looker genspider’ to generate a spider template, and then use ‘looter shell’ to debug and make your spider.
If the website is dynamic, sniff first, and try to get all the api links generated by ajax; if there is no api, go to the next step.
Some websites do not directly expose their ajax api link. In this case, you need to construct the api link according to the rules.
If it doesn’t work, then you have to use requestium to render the JS and crawl the page.
As for the problem of login, IP Proxy, Captcha, distributed crawler and so on, plz work them out by yourself.
If your crawler project is required to use Scrapy, then you can also copy the parsing code of the looter to Scrapy painlessly (both of them use parsel after all)

Once you’ve mastered all the steps above, you can almost crawl everything you want!