有人爬过 Google 的搜索结果没

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

V2EX 提问指南

这是一个创建于 3429 天前的主题，其中的信息可能已经有所发展或是发生改变。

我知道这很奇葩。。但是导师让做。

我试了下curl和scrapy shell都看不到内容。。。难道需要selenium模拟？

Selenium

Scrapy

curl

19 条回复 • 2015-06-13 18:06:28 +08:00

lianyue

2015-06-13 11:56:09 +08:00

google 有伐限制爬多了点就。。。。验证码我记得 google 有ajax 的搜索api json格式的自己找找吧

lincanbin

2015-06-13 11:57:21 +08:00

Andy1999

2015-06-13 12:01:12 +08:00 via iPhone

会限制的

shierji

2015-06-13 12:05:08 +08:00

@lianyue
@lincanbin
@Andy1999 我的意思是就是第一次都抓不到结果。。
多次的问题肯定可以通过代理来解决嘛

dong3580

2015-06-13 12:05:29 +08:00 via Android

嵌套一个webbrower控件试试，应该可以，C#里面有ms提供的，其他语言的话不清楚。谷歌搜索检测太多，多玩几次就会干掉。一楼说的那个ajax，每页返回4条，一共18到20页结果，再请求就不行了。

ericls

2015-06-13 12:07:38 +08:00

曲线救国试试 aol.com

binux

2015-06-13 12:12:43 +08:00

不要爬桌面版

elgoog

2015-06-13 12:36:40 +08:00

API不行？

gdwest

2015-06-13 12:39:28 +08:00 via iPhone

这个问题你要@国内各大搜索引擎

zhjits

2015-06-13 12:52:32 +08:00

Pricing

JSON/Atom Custom Search API pricing and quotas depend on the engine's edition:

Custom Search Engine (free)
For CSE users, the API provides 100 search queries per day for free. If you need more, you may sign up for billing in the Developers Console. Additional requests cost $5 per 1000 queries, up to 10k queries per day.

Google Site Search (paid).
For detailed information on GSS usage limits and quotas, please check GSS pricing options.

icedx

2015-06-13 13:01:11 +08:00

有

icedx

2015-06-13 13:15:51 +08:00

V2EX 不支持缩进所以你把下面两行代码贴到Python 中就能看到解决方法了

Code="""import[Space]requests\n\nConf_UseProxy=0\n\nHeaders={'User-Agent':'Mozilla/4.0[Space](Windows;[Space]MSIE[Space]6.0;[Space]Windows[Space]NT[Space]5.2)'}\n\nif[Space]Conf_UseProxy==1:\n[Space]import[Space]socks[Space]\n[Space]import[Space]socket[Space]\n[Space]socks.set_default_proxy(socks.SOCKS5,'localhost',1079,rdns=True)[Space]\n[Space]socket.socket[Space]=[Space]socks.socksocket\n\ndef[Space]GetGoogle(KeyWord):\n[Space]Url='https://www.google.com/search?q='+KeyWord\n[Space]Response=requests.get(Url,headers=Headers)\n[Space]print[Space]Response.content\n\nGetGoogle('QueenSamaprpr')"""
print Code.replace('\\n','\n').replace('[Space]',' ')

shierji

2015-06-13 13:22:49 +08:00

@icedx gist啊 - -

这个，，没缩进要死要死啊- -

shierji

2015-06-13 13:23:19 +08:00

@zhjits 我看过这个。。。贵的要死。。。

icedx

2015-06-13 13:25:59 +08:00

这个不是对楼主说的: 不要吐槽我的代码风格, 你们那些个没解决问题的没资格吐槽我的代码风格!(指

这个才是对楼主说的: 翻页的参数是: &start=(目标页数-1)*10

icedx

2015-06-13 13:29:01 +08:00

@shierji 这就两行没缩进你贴到Python 就能看到真正的代码了...

https://gist.github.com/anonymous/d559b998e47afedd668b

shierji

2015-06-13 16:47:28 +08:00

@icedx 额。你的思路是强行用IE啊- -
那跟 binux思路差不多

shierji

2015-06-13 17:00:07 +08:00

@icedx
我先出去吃饭。。一会回来试试。。。

icedx

2015-06-13 18:06:28 +08:00 via Android

@shierji 他的方法有点问题
如果用移动版的话
有很多网站Google 会贴出移动版的地址