電腦領域 HKEPC Hardware - Powered by Discuz! Board

標題: Retrieving data from AJAX generated website [打印本頁]

作者: ronstudio 時間: 2015-3-31 21:11 標題: Retrieving data from AJAX generated website

May I ask if anyone ever tried to retrieve data from website which is created by AJAX in website? What kind of tools do you use?

Thanks in advance~

作者: Jackass_TMxCK 時間: 2015-3-31 22:11

Do you know or even hear of "Same Origin Policy"?

If you solve this restriction, then it has no difference other than a normal website with RESTful API feeding data

作者: ronstudio 時間: 2015-4-1 08:18

回覆 2# Jackass_TMxCK

Thanks for the help! I'll do further research regarding this area!

作者: Jackass_TMxCK 時間: 2015-4-1 08:57

回覆 Jackass_TMxCK

Thanks for the help! I'll do further research regarding this area!
ronstudio 發表於 1/4/2015 08:18 AM

I think of something, server side cURL has no limitations on some origin policy

Check out PHP simple proxy

作者: ronstudio 時間: 2015-4-1 17:24

Thanks, so far I notice most solution people using are loading some kind of browser module in order to simulate the web browsing in order to take the data. At least in Perl, there is a module which simulate firefox to achieve this.

This is doable, but really takes more effort and need furtfurther study

作者: chi251155 時間: 2015-4-1 19:32

回覆 1# ronstudio

curl or wget

作者: ronstudio 時間: 2015-4-4 23:48

The problem of using simple wget is the content of the page are generated with AJAX javascript. So when I use the simple wget, it gets the html structure but nothing of the content which I need to parse.

作者: toylet 時間: 2015-4-5 00:02

提示: 作者被禁止或刪除內容自動屏蔽

作者: chi251155 時間: 2015-4-5 00:06

回覆 7# ronstudio

use wget and curl to get the message of ajax. you know ajax is a method of communication right? you can easily find the url of the data source, it is easier than crawling the web page, as the contents are always formatted in json or xml.

作者: Jackass_TMxCK 時間: 2015-4-5 00:16

回覆 ronstudio

use wget and curl to get the message of ajax. you know ajax is a method of co ...
chi251155 發表於 5/4/2015 12:06 AM

Same Origin Policy....

作者: ronstudio 時間: 2015-4-5 01:16

Thanks everyone for the reply... I'm basically clueless on this now.

Say for example, I want to retrieve the table from Peach airline like following link:
http://book.flypeach.com/default ... n-US&ao=B2CENUS

I'm now using Perl with Mechanize:Firefox to simulate myself as a Firefox to get the page. But the problem is the content of "View source" is different between Firefox and Chrome from what I see so far. In Chrome, I can see the content of the table with those price info.

But such price info table does not show up in Firefox's "View source" . Once I can have the Firefox showing any of its info, then I can have the Perl module to get the page only after certain element is loaded. But I can't even see such element in firefox

作者: chi251155 時間: 2015-4-6 13:10

回覆 10# Jackass_TMxCK

same origin policy is enforced by the browser. it has nothing to do with sop here.

作者: ronstudio 時間: 2015-4-6 14:18

Thanks everyone~ I finally solve the problem by controlling the Firefox via Perl, so finally able to capture the ajax generated page.
The first difficulty was the page using POST, so I don't have the explicit web link to wget. Then I use the LiveHTTPHeader to check what's going on behind, and manually simulate the same http header and trigger the response from the server.

作者: kazenorin 時間: 2015-4-7 21:28

node.js 有個叫 contextify 既 project (https://github.com/brianmcd/contextify),
大概做到你要既效果?

雖然你已找到另一個方法