Board index » delphi » Indy 9 HTTP & Read Timeout

Indy 9 HTTP & Read Timeout


2004-01-21 05:39:35 AM
delphi24
I continue to have a problem that I was never able to get resolved by myself
or others through the news groups. I have a Delphi 7 application that does a
little bit of web crawling. I use the TidHTTP component to download each
web page. Downloading is done within a loop statement and is supposed to go
until it runs out of URLs. After about 30 pages are downloaded I start
getting Read Timeout errors that never go away. The only way to continue is
to close the application and then re-open it. I have tried using keep-alive,
creating the HTTP component dynamically (freeing it and re-creating it after
each page request), and forcing a disconnect after each request. Nothing
has helped. It seems as though the HTTP component won't release the
connection so the server is preventing new connections.
Anyone have any ideas on this?
Thanks.
 
 

Re:Indy 9 HTTP & Read Timeout

"Greg" <XXXX@XXXXX.COM>wrote in
Quote
I've tried using keep-alive, creating the HTTP component dynamically
(freeing it and re-creating it after each page request), and forcing a
disconnect after each request. Nothing has helped. It seems as though
the HTTP component won't release the connection so the server is
preventing new connections.
Dynamicly recreating it will flush all the data out. Do you have ZA or other
proxies installed? I have a web crawler that gets tens of thousands and its
based on Indy.
Need extra help with an Indy problem?
www.atozed.com/indy/experts/support.html
ELKNews - Get your free copy at www.atozedsoftware.com
 

Re:Indy 9 HTTP & Read Timeout

Quote
Dynamicly recreating it will flush all the data out. Do you have ZA or
other
proxies installed? I have a web crawler that gets tens of thousands and
its
based on Indy.
Just a hardware firewall (router). This problem only occurs on some sites
(none of which I am affiliated with). I can provide a link to a problematic
one if you'd like.
 

Re:Indy 9 HTTP & Read Timeout

"Greg" <XXXX@XXXXX.COM>wrote in news:400da5a5$XXXX@XXXXX.COM:
Quote
Just a hardware firewall (router). This problem only occurs on some sites
(none of which I am affiliated with). I can provide a link to a problematic
one if you'd like.
It sounds like the sites have a DOS attack preventor or other and are just
blocking you.
Want to keep up to date with Indy?
Join Indy News - it free!
www.atozed.com/indy/news/
ELKNews - Get your free copy at www.atozedsoftware.com
 

Re:Indy 9 HTTP & Read Timeout

Chad Z. Hower aka Kudzu writes:
Quote
It sounds like the sites have a DOS attack preventor or other and are
just blocking you.
That's what I suspect too. And there are proxies that prevent "crawling" of
web sites (to be honest, I would do that too).
--
Ben
 

Re:Indy 9 HTTP & Read Timeout

"Chad Z. Hower aka Kudzu" <XXXX@XXXXX.COM>writes
Quote
It sounds like the sites have a DOS attack preventor or other and are just
blocking you.
That doesn't appear to be the reason for one site. I have a server (Windows
2000) and sometimes it crawls it without a problem but other times the web
server needs to be restarted. This happens with a single thread requesting
a page about every second... sometimes two pages a second.
Any other ideas?
 

Re:Indy 9 HTTP & Read Timeout

"Greg" <XXXX@XXXXX.COM>writes
Quote
"Chad Z. Hower aka Kudzu" <XXXX@XXXXX.COM>writes
news:XXXX@XXXXX.COM...
>It sounds like the sites have a DOS attack preventor or other and are
just
>blocking you.

That doesn't appear to be the reason for one site. I have a server
(Windows
2000) and sometimes it crawls it without a problem but other times the web
server needs to be restarted. This happens with a single thread
requesting
a page about every second... sometimes two pages a second.
Strange. it is definitely not TidHTTP - I have an app running that polls six
web servers every second, (intranet). it is been running for four months on
site with no 'incidents'.
It's strange that the *server* needs to be restarted. Do you restart the
server app or reboot the box? I agree with the other posters - sounds like
a dodgy proxy or even the server itself.
Failing that, can you post your HTTP thread code?
Rgds,
Martin
 

Re:Indy 9 HTTP & Read Timeout

"Martin James" <XXXX@XXXXX.COM>writes
Quote

Strange. it is definitely not TidHTTP - I have an app running that polls
six
web servers every second, (intranet). it is been running for four months
on
site with no 'incidents'.

It's strange that the *server* needs to be restarted. Do you restart the
server app or reboot the box? I agree with the other posters - sounds
like
a dodgy proxy or even the server itself.

Failing that, can you post your HTTP thread code?

I can not post the entire code for the thread, but here's the basics of it
with all of the TidHTTP code:
var
http : TidHTTP;
begin
http := TidHTTP.Create(nil);
http.HandleRedirects := TRUE;
http.ReadTimeout := 10000; (tried increasing to more than 60 seconds but
made no difference)
http.Request.BasicAuthentication := TRUE;
(loop start)
...
strRobots := http.Get(strTempURL + 'robots.txt');
...
strHTML := http.Get(strURL);
...
http.DisconnectSocket; (added to try to fix the problem)
...
(loop end)
FreeAndNil(http);
end
The actual server needs restarting. A while back when I first posted about
the problem, one server kept showing new connections for every connection
made but they were never released and that killed that server. It happened
repeatedly.
 

Re:Indy 9 HTTP & Read Timeout

"Greg" <XXXX@XXXXX.COM>writes
Quote
I can not post the entire code for the thread, but here's the basics of it
with all of the TidHTTP code:

var
http : TidHTTP;
begin
http := TidHTTP.Create(nil);
http.HandleRedirects := TRUE;
http.ReadTimeout := 10000; (tried increasing to more than 60 seconds but
made no difference)
http.Request.BasicAuthentication := TRUE;

(loop start)
...

strRobots := http.Get(strTempURL + 'robots.txt');

...

strHTML := http.Get(strURL);

...

http.DisconnectSocket; (added to try to fix the problem)
...

(loop end)

FreeAndNil(http);
end

The actual server needs restarting. A while back when I first posted
about
the problem, one server kept showing new connections for every connection
made but they were never released and that killed that server. It
happened
repeatedly.
An example problematic site is dmoz.org/. Even with using a single
thread with a 2 second pause between requests it still gives me Read Timeout
errors and that is after retrieving only 6 pages. If I close and re-open the
app it works again until the Read Timeout errors happen.
 

Re:Indy 9 HTTP & Read Timeout

"Greg" <XXXX@XXXXX.COM>writes
Quote
An example problematic site is dmoz.org/. Even with using a single
thread with a 2 second pause between requests it still gives me Read
Timeout
errors and that is after retrieving only 6 pages. If I close and re-open
the
app it works again until the Read Timeout errors happen.

Actually, disregard that. The dmoz problem was due to the fact that the
ReadTimeout value was set too low.
 

Re:Indy 9 HTTP & Read Timeout

"Chad Z. Hower aka Kudzu" <XXXX@XXXXX.COM>writes
Quote
It sounds like the sites have a DOS attack preventor or other and are just
blocking you.
I've been in contact with the system administrator of one of the sites and
it doesn't appear that DOS is the problem. What's happening is after about
3 minutes all pages return the Read Timeout errors (regardless of what the
ReadTimeout value is set to). Once this happens, you can not even view the
web site through the web browser so the web server needs to be restarted.
The web server is Windows 2000 (not sure yet which one, although I know it's
not Professional).
Strangely, this doesn't happen every time, but it does happen the majority
of the time.
 

Re:Indy 9 HTTP & Read Timeout

im having the same problem with a crawler for www.shoutcast.com ... although
it wont even get a single website .. i use this code
var
HTTPResult : TStringStream;
try
HTTP.Head(SearchURL);
HTTP.Get(SearchURL, HTTPResult);
finally
WriteToLog(HTTPResult.DataString); // for debugging sockets
ProcessSearchData(HTTPResult); // custom procedure for splitting the
returned data into tokens / passes the string stream
HTTPResult.Free;
end;
SearchURL contains scastlb2.shoutcast.com/directory/ ...
basically the program is "suppost" to get shoutcast servers from the webpage
it downloads but even outside of a loop the site gives a readtimeout
instantly ... but the url opens in IE just fine .. yes i made sure the URL
was passed right by calling the WriteToLog procedure right after get which
passes the info to a memo ... not sure how to get this one working <shrug>
ive used TidHTTP alot in the past and never had a problem with this site ..
maybe my code is not doing things the right way?
"Greg" <XXXX@XXXXX.COM>writes
Quote
I continue to have a problem that I was never able to get resolved by
myself
or others through the news groups. I have a Delphi 7 application that does a
little bit of web crawling. I use the TidHTTP component to download each
web page. Downloading is done within a loop statement and is supposed to
go
until it runs out of URLs. After about 30 pages are downloaded I start
getting Read Timeout errors that never go away. The only way to continue
is
to close the application and then re-open it. I have tried using
keep-alive,
creating the HTTP component dynamically (freeing it and re-creating it
after
each page request), and forcing a disconnect after each request. Nothing
has helped. It seems as though the HTTP component won't release the
connection so the server is preventing new connections.

Anyone have any ideas on this?

Thanks.


 

Re:Indy 9 HTTP & Read Timeout

forgot the --
HTTPResult := TStringStream.Create('');
-- call in my message .. but its in the program so thats not the problem :)
"MattW" <XXXX@XXXXX.COM>writes
Quote
im having the same problem with a crawler for www.shoutcast.com ...
although
it wont even get a single website .. i use this code

var
HTTPResult : TStringStream;

try
HTTP.Head(SearchURL);
HTTP.Get(SearchURL, HTTPResult);
finally
WriteToLog(HTTPResult.DataString); // for debugging sockets
ProcessSearchData(HTTPResult); // custom procedure for splitting the
returned data into tokens / passes the string stream
HTTPResult.Free;
end;

SearchURL contains scastlb2.shoutcast.com/directory/
...
basically the program is "suppost" to get shoutcast servers from the
webpage
it downloads but even outside of a loop the site gives a readtimeout
instantly ... but the url opens in IE just fine .. yes i made sure the URL
was passed right by calling the WriteToLog procedure right after get which
passes the info to a memo ... not sure how to get this one working <shrug>

ive used TidHTTP alot in the past and never had a problem with this site
..
maybe my code is not doing things the right way?

"Greg" <XXXX@XXXXX.COM>writes
news:400da021$XXXX@XXXXX.COM...
>I continue to have a problem that I was never able to get resolved by
myself
>or others through the news groups. I have a Delphi 7 application that does a
>little bit of web crawling. I use the TidHTTP component to download
each
>web page. Downloading is done within a loop statement and is supposed
to
go
>until it runs out of URLs. After about 30 pages are downloaded I start
>getting Read Timeout errors that never go away. The only way to
continue
is
>to close the application and then re-open it. I have tried using
keep-alive,
>creating the HTTP component dynamically (freeing it and re-creating it
after
>each page request), and forcing a disconnect after each request.
Nothing
>has helped. It seems as though the HTTP component won't release the
>connection so the server is preventing new connections.
>
>Anyone have any ideas on this?
>
>Thanks.
>
>