1.Preface
Nowerdays, there are more and more webpages rendered through javascript, tranditional spider such wget, curl is useless.
An alternative solution is webkit, the open source browser engine used most famously in Apple’s Safari browser. Webkit has now been ported to the Qt framework and can be used through its Python bindings.
This solution has those advantages:
- browser, lots of pic,js etc, automaticlly done
- Python, my fav language
2.Environment
a. webkit (QtWebkit)
It’s ported to Qt, called QtWebkit. Here is Qt install
-
Download the achieve
1
|
|
1
|
|
- Compiling
1 2 3 |
|
Pay attention to configure’s log, webkit module should be “yes”
WebKit module ………. yes
b. PyQt
PyQt is a set of Python v2 and v3 bindings for Digia’s Qt application framework and runs on all platforms supported by Qt including Windows, MacOS/X and Linux. PyQt5 supports Qt v5. PyQt4 supports Qt v4 and will build against Qt v5. The bindings are implemented as a set of Python modules and contain over 620 classes.
ps: another option is PySide
- Sip install
1 2 3 4 5 6 7 8 9 10 11 |
|
- PyQt4 install
1 2 3 4 5 |
|
ps1: you can see this in “/your/python/env/share/sip/PyQt4”, if your sip works.
1 2 |
|
ps2: you can sess this in “env/lib/python2.7/site-packages/PyQt4”,if your python package works.
1 2 3 |
|
c. Xvfb
If you access your sever by ssh, without Graphic Interface, xvfb is one of your choice.
The error log is:
“xxxxxx cannot connect to X server”
1 2 3 |
|
d. TTP and Chinese
Last, qtwebkit works, but maybe you have a new problem is that, chinese word in webkit is [][][][].
TrueType,ttp is the root of the problem, here is the way:
1 2 3 4 |
|
3.One Example
Here is a simple class that renders a webpage (including executing any JavaScript) and then saves the final HTML to a file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|