patrickmn / osg Goto Github PK
View Code? Open in Web Editor NEWUniversal sitemap generator
Home Page: https://patrickmn.com/projects/osg/
License: Other
Universal sitemap generator
Home Page: https://patrickmn.com/projects/osg/
License: Other
Optimus Sitemap Generator (OSG) is a universal XML sitemap generator that works by crawling your website, avoiding excessive bandwidth overhead by only scanning the contents of pages that have changed since the last time the sitemap was generated. == Installation Download OSG from http://patrickmylund.com/projects/osg/ If you have Go installed, you can run: go get github.com/pmylund/osg (an osg binary will be added to your GOPATH/bin folder) Note: You do not need to have Go installed to run the stand-alone version. == Usage OSG takes the name of a sitemap, and a list of the pages from which crawling should begin. Any files linked to will be included in the sitemap, and pages' links will be followed automatically. OSG will not crawl outside the domain of a given link (to do this, list a page on each domain to include in the sitemap.) ./osg sitemap.xml example.com ./osg sitemap.xml example.com/category1 example.com/category2 ./osg -v sitemap.xml example.com Run osg without any arguments for details, like how to ignore robots directives like nofollow, or how to exclude particular URL patterns. See http://patrickmylund.com/projects/osg/ for more information.
Hello.
I've just tried to crawl http://reactivex.io/documentation because I found out some documentation pages appear to be hidden. Unfortunately, osg
crashed, outputting the following stacktrace with an error signal of 2
:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x5787d0]
goroutine 856 [running]:
net/url.(*URL).String(0x0, 0x85, 0x0)
/usr/lib/go/src/net/url/url.go:742 +0x40
main.GetLinks(0x7f4045d63fd0, 0xc420a66bc0, 0xc420166800, 0x0, 0x40b78f, 0xc420461ca0)
/home/vrakfall/go/src/github.com/patrickmn/osg/main.go:290 +0x2f6
main.(*Crawler).GetLinks(0xc42014e2c0, 0x7f4045d63fd0, 0xc420a66bc0, 0xc420166800, 0xc420166080, 0xc420c7b860, 0x19, 0x0)
/home/vrakfall/go/src/github.com/patrickmn/osg/main.go:171 +0x5a
main.(*Crawler).Crawl(0xc42014e2c0, 0xc420166800, 0xc420166080, 0x0, 0x0)
/home/vrakfall/go/src/github.com/patrickmn/osg/main.go:147 +0x52d
created by main.(*Crawler).Crawl
/home/vrakfall/go/src/github.com/patrickmn/osg/main.go:154 +0x5c2
Osg
found some dead links normally returning a 404 status just before crashing. This does look like a kind of memory overflow to me (I don't know if that can happen in go
, my knowledge of it being limited.).
It also worked fine on another website I tried. The error seems linked to reactivex.io
. Maybe there are too many sublinks? I'm sure it's not infinite tho, I tested it with another script.
While generating a sitemap. It also adds a URL like javascript:doSomething('http://exmple.com')
which is invalid URL.
goroutine 9534 [runnable]:
net._C2func_getaddrinfo(0x7fcf3c0010f0, 0x0)
/tmp/go-build758714176/net/_obj/_cgo_defun.c:42 +0x34
net.cgoLookupIPCNAME(0xf843be11b0, 0xc, 0x0, 0x0, 0x0, ...)
/tmp/go-build758714176/net/_obj/_cgo_gotypes.go:179 +0x142
net.cgoLookupIP(0xf843be11b0, 0xc, 0x0, 0x0, 0x0, ...)
/tmp/go-build758714176/net/_obj/_cgo_gotypes.go:225 +0x61
net.cgoLookupHost(0xf843be11b0, 0xc, 0x0, 0x0, 0x0, ...)
/tmp/go-build758714176/net/_obj/_cgo_gotypes.go:103 +0x76
net.lookupHost(0xf843be11b0, 0xf80000000c, 0x0, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/lookup_unix.go:56 +0x5a
net.LookupHost(0xf843be11b0, 0xf80000000c, 0x0, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/doc.go:10 +0x5a
net.hostPortToIP(0x5e6144, 0x3, 0xf843be11b0, 0xf, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/ipsock.go:120 +0x223
net.ResolveTCPAddr(0x5e6144, 0x7fcf00000003, 0xf843be11b0, 0x7063740000000f, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/tcpsock.go:31 +0x51
net.resolveNetAddr(0x5f34bc, 0x6c61696400000004, 0x5e6144, 0x3, 0xf843be11b0, ...)
/home/patrick/apps/go/src/pkg/net/dial.go:50 +0x504
net.Dial(0x5e6144, 0x3, 0xf843be11b0, 0x20000000f, 0xf843c7a880, ...)
/home/patrick/apps/go/src/pkg/net/dial.go:92 +0x62
net/http.(_Transport).dial(0xf84005dac0, 0x5e6144, 0x70637400000003, 0xf843be11b0, 0xf, ...)
/home/patrick/apps/go/src/pkg/net/http/transport.go:299 +0xd5
net/http.(_Transport).getConn(0xf84005dac0, 0xf842197f00, 0xf842197f00, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/http/transport.go:311 +0xbe
net/http.(_Transport).RoundTrip(0xf84005dac0, 0xf8421f5a80, 0xf80000004e, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/http/transport.go:155 +0x2ba
net/http.send(0xf8421f5a80, 0xf84005a7e0, 0xf84005dac0, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/http/client.go:133 +0x3ca
net/http.(_Client).doFollowingRedirects(0x71be18, 0xf8421f5a80, 0x0, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/http/client.go:227 +0x5e2
net/http.(_Client).Do(0x71be18, 0xf8421f5a80, 0x726573550000000a, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/http/client.go:100 +0x7e
main.Get(0xf843be5140, 0x7fcf0000004e, 0x0, 0x4e, 0xf843c7a800, ...)
/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:321 +0x15a
main.(_Crawler).Crawl(0xf840068960, 0xf843bee930, 0xf840000310, 0x5e3f74, 0x0, ...)
/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:111 +0x5b1
created by main.(*Crawler).Crawl
/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:144 +0xd6d
Can you, please, help me to understand what's wrong?
Command: ./osg sitemap.xml http://mysitename.org
hi,
maybe you'd say it kills my apache because its "overbooked" having configured more workers, than if they all actually have to work at the same time, they allocate too much memory. you'd be right, but i think this overbooking is quite common and has its advantages.
so, i put a sleep for 500 ms onto line 157, into the crawl function, right before the recursion. i admit that i really dont exactly know what im doing, but maybe someone who understands all of the code and strange waitgroup parallel syncy stuff that seems to be going on there, might think of a smart way to only put a light load on a webserver with osg.
and yes, i used -c1, which is default.
cheers,
fil
After downloading the binary and running I'm getting the follow error.
Any idea why?
admin@dell-lat ~/Downloads/osg $ ./osg sitemap.xml patrickmn.com
panic: crypto: requested hash function is unavailable
goroutine 3 [running]:
crypto.Hash.New(0x3800000005, 0x40a6b5, 0x7fd2b2efa3ff, 0x10)
/home/patrick/apps/go/src/pkg/crypto/crypto.go:62 +0x95
crypto/x509.(*Certificate).CheckSignature(0xf8400c1580, 0x7fd200000004, 0xf84011400e, 0x10d600000438, 0xf84011445a, ...)
/home/patrick/apps/go/src/pkg/crypto/x509/x509.go:391 +0x68
crypto/x509.(*Certificate).CheckSignatureFrom(0xf8400c12c0, 0xf8400c1580, 0x0, 0x0, 0xf840121538, ...)
/home/patrick/apps/go/src/pkg/crypto/x509/x509.go:370 +0x15a
crypto/x509.(*CertPool).findVerifiedParents(0xf840122660, 0xf8400c12c0, 0x0, 0x0, 0x60, ...)
/home/patrick/apps/go/src/pkg/crypto/x509/cert_pool.go:44 +0x17d
crypto/x509.(*Certificate).buildChains(0xf8400c12c0, 0xf840202cc0, 0x7fd2b2efa698, 0x100000001, 0x7fd2b2efa6b0, ...)
/home/patrick/apps/go/src/pkg/crypto/x509/verify.go:198 +0x1c0
crypto/x509.(*Certificate).Verify(0xf8400c12c0, 0x0, 0x0, 0xf840122660, 0xf840122700, ...)
/home/patrick/apps/go/src/pkg/crypto/x509/verify.go:177 +0x1c1
crypto/tls.(*Conn).clientHandshake(0xf840109000, 0x0, 0x0, 0x7fd2b408a100)
/home/patrick/apps/go/src/pkg/crypto/tls/handshake_client.go:117 +0xfab
----- stack segment boundary -----
crypto/tls.(*Conn).Handshake(0xf840109000, 0x0, 0x0, 0xf840109000)
/home/patrick/apps/go/src/pkg/crypto/tls/conn.go:808 +0xdc
net/http.(*Transport).getConn(0xf84005dac0, 0xf84005c960, 0xf84005c960, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/http/transport.go:369 +0x4aa
net/http.(*Transport).RoundTrip(0xf84005dac0, 0xf8400b70c0, 0x16, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/http/transport.go:155 +0x2ba
net/http.send(0xf8400b70c0, 0xf84005a7e0, 0xf84005dac0, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/http/client.go:133 +0x3ca
net/http.(*Client).doFollowingRedirects(0x71be18, 0xf8400b7000, 0xf8400c4000, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/http/client.go:227 +0x5e2
net/http.(*Client).Do(0x71be18, 0xf8400b7000, 0x726573550000000a, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/http/client.go:100 +0x7e
main.Get(0xf840069c40, 0x7fd200000015, 0x0, 0x15, 0xf840069c60, ...)
/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:321 +0x15a
main.(*Crawler).Crawl(0xf840069980, 0xf840000230, 0xf840000310, 0x5e3f74, 0x0, ...)
/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:111 +0x5b1
main._func_004(0xf840077280, 0xf840077268, 0xf840077270, 0x0, 0x0, ...)
/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:347 +0x72
created by main.generateSitemap
/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:348 +0x31f
goroutine 1 [runnable]:
main.generateSitemap(0x7ffd06f6caed, 0xb, 0xf840077250, 0x100000001, 0x0, ...)
/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:397 +0x5c7
main.main()
/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:455 +0xcb0
goroutine 2 [syscall]:
created by runtime.main
/home/patrick/apps/go/src/pkg/runtime/proc.c:221
goroutine 4 [runnable]:
sync.runtime_Semacquire(0xf840077430, 0xf840077430)
/home/patrick/apps/go/src/pkg/runtime/zsema_amd64.c:146 +0x25
sync.(*WaitGroup).Wait(0xf840069940, 0x4136b8)
/home/patrick/apps/go/src/pkg/sync/waitgroup.go:78 +0xf2
main._func_005(0xf840077260, 0xf840077258, 0x0, 0x0)
/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:391 +0x28
created by main.generateSitemap
/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:393 +0x58e
goroutine 5 [syscall]:
syscall.Syscall6()
/home/patrick/apps/go/src/pkg/syscall/asm_linux_amd64.s:40 +0x5
syscall.EpollWait(0xf800000007, 0xf8400bd010, 0xa0000000a, 0xffffffff, 0xc, ...)
/home/patrick/apps/go/src/pkg/syscall/zerrors_linux_amd64.go:1781 +0xa1
net.(*pollster).WaitFD(0xf8400bd000, 0xf84005b1c0, 0x0, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/fd_linux.go:146 +0x110
net.(*pollServer).Run(0xf84005b1c0, 0x0)
/home/patrick/apps/go/src/pkg/net/fd.go:236 +0xe4
created by net.newPollServer
/home/patrick/apps/go/src/pkg/net/newpollserver.go:35 +0x382
goroutine 6 [chan receive]:
net.(*pollServer).WaitRead(0xf84005b1c0, 0xf8400b8120, 0xf84005c690, 0xb, 0x1, ...)
/home/patrick/apps/go/src/pkg/net/fd.go:268 +0x73
net.(*netFD).Read(0xf8400b8120, 0xf8400bf000, 0x100000001000, 0xffffffff, 0xf84005a450, ...)
/home/patrick/apps/go/src/pkg/net/fd.go:428 +0x1ec
net.(*TCPConn).Read(0xf8400773f8, 0xf8400bf000, 0x100000001000, 0xe7000000000, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/tcpsock_posix.go:87 +0xce
bufio.(*Reader).fill(0xf84005b300, 0xf840001be0)
/home/patrick/apps/go/src/pkg/bufio/bufio.go:77 +0xf0
bufio.(*Reader).Peek(0xf84005b300, 0xf800000001, 0xf8400c4001, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/bufio/bufio.go:102 +0xbc
net/http.(*persistConn).readLoop(0xf840084d80, 0x0)
/home/patrick/apps/go/src/pkg/net/http/transport.go:521 +0xab
created by net/http.(*Transport).getConn
/home/patrick/apps/go/src/pkg/net/http/transport.go:382 +0x5df
go get github.com/pmylund/osg
package github.com/golang/net/html: code in directory /Users/me/work/src/github.com/golang/net/html expects import "golang.org/x/net/html"
When a link on the root of the website has a link like this index.php
it will try to access this page:
http://subdomain.example.comindex.php
Also when the link is ./index.php
it will list this site in the sitemap : http://subdomain.example.com./index.php
When running the tool from the server the site I am trying to index resides on, I get the error:
Error opening [URL]: Get [URL]: x509: certificate signed by unknown authority
Where [URL] is the URL of the site.
When trying to run from another server I have to that same URL I get:
Error opening [URL]: Get [URL]: remote error: internal error
Any ideas? Thank you!
Google webmaster gives error - "invalid tag:urlset" in generated sitemap with below command -
./sitemap -c 10 -gzip-level 1 -max-crawl 0 --no-lastmod sitemap.xml https://example.com
hey, i crawl a site, and in the resulting xml file, all the lastmod times are like this:
0001-01-01T00:00:00+07:00
any idea, why this could happen?
my idea was that the crawl function maybe fails to extract the data from the http headers..
lm, err := time.Parse(http.TimeFormat, res.Header.Get("Last-Modified"))
if err != nil {
log.Println("Couldn't parse Last-Modified time in header", err)
}
so i inserted this error checking, but it didnt fire.
any pointers how to debug/fix this greatly appreciated.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.