Code Monkey home page Code Monkey logo

osg's Introduction

Optimus Sitemap Generator (OSG) is a universal XML sitemap generator that works
by crawling your website, avoiding excessive bandwidth overhead by only scanning
the contents of pages that have changed since the last time the sitemap was
generated.

== Installation

Download OSG from http://patrickmylund.com/projects/osg/

If you have Go installed, you can run: go get github.com/pmylund/osg
(an osg binary will be added to your GOPATH/bin folder)

Note: You do not need to have Go installed to run the stand-alone version.

== Usage

OSG takes the name of a sitemap, and a list of the pages from which crawling
should begin. Any files linked to will be included in the sitemap, and pages'
links will be followed automatically. OSG will not crawl outside the domain
of a given link (to do this, list a page on each domain to include in the
sitemap.)

  ./osg sitemap.xml example.com
  ./osg sitemap.xml example.com/category1 example.com/category2
  ./osg -v sitemap.xml example.com

Run osg without any arguments for details, like how to ignore robots directives
like nofollow, or how to exclude particular URL patterns.

See http://patrickmylund.com/projects/osg/ for more information.

osg's People

Contributors

etna-mhoskison avatar nscience avatar patrickmn avatar sametsisartenep avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

osg's Issues

Runtime error while crawling `reactivex.io`.

Hello.

I've just tried to crawl http://reactivex.io/documentation because I found out some documentation pages appear to be hidden. Unfortunately, osg crashed, outputting the following stacktrace with an error signal of 2:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x5787d0]

goroutine 856 [running]:
net/url.(*URL).String(0x0, 0x85, 0x0)
        /usr/lib/go/src/net/url/url.go:742 +0x40
main.GetLinks(0x7f4045d63fd0, 0xc420a66bc0, 0xc420166800, 0x0, 0x40b78f, 0xc420461ca0)
        /home/vrakfall/go/src/github.com/patrickmn/osg/main.go:290 +0x2f6
main.(*Crawler).GetLinks(0xc42014e2c0, 0x7f4045d63fd0, 0xc420a66bc0, 0xc420166800, 0xc420166080, 0xc420c7b860, 0x19, 0x0)
        /home/vrakfall/go/src/github.com/patrickmn/osg/main.go:171 +0x5a
main.(*Crawler).Crawl(0xc42014e2c0, 0xc420166800, 0xc420166080, 0x0, 0x0)
        /home/vrakfall/go/src/github.com/patrickmn/osg/main.go:147 +0x52d
created by main.(*Crawler).Crawl
        /home/vrakfall/go/src/github.com/patrickmn/osg/main.go:154 +0x5c2

Osg found some dead links normally returning a 404 status just before crashing. This does look like a kind of memory overflow to me (I don't know if that can happen in go, my knowledge of it being limited.).

It also worked fine on another website I tried. The error seems linked to reactivex.io. Maybe there are too many sublinks? I'm sure it's not infinite tho, I tested it with another script.

Strange errors in output

goroutine 9534 [runnable]:
net._C2func_getaddrinfo(0x7fcf3c0010f0, 0x0)
/tmp/go-build758714176/net/_obj/_cgo_defun.c:42 +0x34
net.cgoLookupIPCNAME(0xf843be11b0, 0xc, 0x0, 0x0, 0x0, ...)
/tmp/go-build758714176/net/_obj/_cgo_gotypes.go:179 +0x142
net.cgoLookupIP(0xf843be11b0, 0xc, 0x0, 0x0, 0x0, ...)
/tmp/go-build758714176/net/_obj/_cgo_gotypes.go:225 +0x61
net.cgoLookupHost(0xf843be11b0, 0xc, 0x0, 0x0, 0x0, ...)
/tmp/go-build758714176/net/_obj/_cgo_gotypes.go:103 +0x76
net.lookupHost(0xf843be11b0, 0xf80000000c, 0x0, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/lookup_unix.go:56 +0x5a
net.LookupHost(0xf843be11b0, 0xf80000000c, 0x0, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/doc.go:10 +0x5a
net.hostPortToIP(0x5e6144, 0x3, 0xf843be11b0, 0xf, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/ipsock.go:120 +0x223
net.ResolveTCPAddr(0x5e6144, 0x7fcf00000003, 0xf843be11b0, 0x7063740000000f, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/tcpsock.go:31 +0x51
net.resolveNetAddr(0x5f34bc, 0x6c61696400000004, 0x5e6144, 0x3, 0xf843be11b0, ...)
/home/patrick/apps/go/src/pkg/net/dial.go:50 +0x504
net.Dial(0x5e6144, 0x3, 0xf843be11b0, 0x20000000f, 0xf843c7a880, ...)
/home/patrick/apps/go/src/pkg/net/dial.go:92 +0x62
net/http.(_Transport).dial(0xf84005dac0, 0x5e6144, 0x70637400000003, 0xf843be11b0, 0xf, ...)
/home/patrick/apps/go/src/pkg/net/http/transport.go:299 +0xd5
net/http.(_Transport).getConn(0xf84005dac0, 0xf842197f00, 0xf842197f00, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/http/transport.go:311 +0xbe
net/http.(_Transport).RoundTrip(0xf84005dac0, 0xf8421f5a80, 0xf80000004e, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/http/transport.go:155 +0x2ba
net/http.send(0xf8421f5a80, 0xf84005a7e0, 0xf84005dac0, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/http/client.go:133 +0x3ca
net/http.(_Client).doFollowingRedirects(0x71be18, 0xf8421f5a80, 0x0, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/http/client.go:227 +0x5e2
net/http.(_Client).Do(0x71be18, 0xf8421f5a80, 0x726573550000000a, 0x0, 0x0, ...)
/home/patrick/apps/go/src/pkg/net/http/client.go:100 +0x7e
main.Get(0xf843be5140, 0x7fcf0000004e, 0x0, 0x4e, 0xf843c7a800, ...)
/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:321 +0x15a
main.(_Crawler).Crawl(0xf840068960, 0xf843bee930, 0xf840000310, 0x5e3f74, 0x0, ...)
/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:111 +0x5b1
created by main.(*Crawler).Crawl
/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:144 +0xd6d

Can you, please, help me to understand what's wrong?
Command: ./osg sitemap.xml http://mysitename.org

does it make sense to put some sleep's somewhere to make it not kill my webserver

hi,
maybe you'd say it kills my apache because its "overbooked" having configured more workers, than if they all actually have to work at the same time, they allocate too much memory. you'd be right, but i think this overbooking is quite common and has its advantages.

so, i put a sleep for 500 ms onto line 157, into the crawl function, right before the recursion. i admit that i really dont exactly know what im doing, but maybe someone who understands all of the code and strange waitgroup parallel syncy stuff that seems to be going on there, might think of a smart way to only put a light load on a webserver with osg.

and yes, i used -c1, which is default.

cheers,
fil

hash function is not available

After downloading the binary and running I'm getting the follow error.

Any idea why?

admin@dell-lat ~/Downloads/osg $ ./osg sitemap.xml patrickmn.com
panic: crypto: requested hash function is unavailable

goroutine 3 [running]:
crypto.Hash.New(0x3800000005, 0x40a6b5, 0x7fd2b2efa3ff, 0x10)
	/home/patrick/apps/go/src/pkg/crypto/crypto.go:62 +0x95
crypto/x509.(*Certificate).CheckSignature(0xf8400c1580, 0x7fd200000004, 0xf84011400e, 0x10d600000438, 0xf84011445a, ...)
	/home/patrick/apps/go/src/pkg/crypto/x509/x509.go:391 +0x68
crypto/x509.(*Certificate).CheckSignatureFrom(0xf8400c12c0, 0xf8400c1580, 0x0, 0x0, 0xf840121538, ...)
	/home/patrick/apps/go/src/pkg/crypto/x509/x509.go:370 +0x15a
crypto/x509.(*CertPool).findVerifiedParents(0xf840122660, 0xf8400c12c0, 0x0, 0x0, 0x60, ...)
	/home/patrick/apps/go/src/pkg/crypto/x509/cert_pool.go:44 +0x17d
crypto/x509.(*Certificate).buildChains(0xf8400c12c0, 0xf840202cc0, 0x7fd2b2efa698, 0x100000001, 0x7fd2b2efa6b0, ...)
	/home/patrick/apps/go/src/pkg/crypto/x509/verify.go:198 +0x1c0
crypto/x509.(*Certificate).Verify(0xf8400c12c0, 0x0, 0x0, 0xf840122660, 0xf840122700, ...)
	/home/patrick/apps/go/src/pkg/crypto/x509/verify.go:177 +0x1c1
crypto/tls.(*Conn).clientHandshake(0xf840109000, 0x0, 0x0, 0x7fd2b408a100)
	/home/patrick/apps/go/src/pkg/crypto/tls/handshake_client.go:117 +0xfab
----- stack segment boundary -----
crypto/tls.(*Conn).Handshake(0xf840109000, 0x0, 0x0, 0xf840109000)
	/home/patrick/apps/go/src/pkg/crypto/tls/conn.go:808 +0xdc
net/http.(*Transport).getConn(0xf84005dac0, 0xf84005c960, 0xf84005c960, 0x0, 0x0, ...)
	/home/patrick/apps/go/src/pkg/net/http/transport.go:369 +0x4aa
net/http.(*Transport).RoundTrip(0xf84005dac0, 0xf8400b70c0, 0x16, 0x0, 0x0, ...)
	/home/patrick/apps/go/src/pkg/net/http/transport.go:155 +0x2ba
net/http.send(0xf8400b70c0, 0xf84005a7e0, 0xf84005dac0, 0x0, 0x0, ...)
	/home/patrick/apps/go/src/pkg/net/http/client.go:133 +0x3ca
net/http.(*Client).doFollowingRedirects(0x71be18, 0xf8400b7000, 0xf8400c4000, 0x0, 0x0, ...)
	/home/patrick/apps/go/src/pkg/net/http/client.go:227 +0x5e2
net/http.(*Client).Do(0x71be18, 0xf8400b7000, 0x726573550000000a, 0x0, 0x0, ...)
	/home/patrick/apps/go/src/pkg/net/http/client.go:100 +0x7e
main.Get(0xf840069c40, 0x7fd200000015, 0x0, 0x15, 0xf840069c60, ...)
	/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:321 +0x15a
main.(*Crawler).Crawl(0xf840069980, 0xf840000230, 0xf840000310, 0x5e3f74, 0x0, ...)
	/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:111 +0x5b1
main._func_004(0xf840077280, 0xf840077268, 0xf840077270, 0x0, 0x0, ...)
	/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:347 +0x72
created by main.generateSitemap
	/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:348 +0x31f

goroutine 1 [runnable]:
main.generateSitemap(0x7ffd06f6caed, 0xb, 0xf840077250, 0x100000001, 0x0, ...)
	/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:397 +0x5c7
main.main()
	/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:455 +0xcb0

goroutine 2 [syscall]:
created by runtime.main
	/home/patrick/apps/go/src/pkg/runtime/proc.c:221

goroutine 4 [runnable]:
sync.runtime_Semacquire(0xf840077430, 0xf840077430)
	/home/patrick/apps/go/src/pkg/runtime/zsema_amd64.c:146 +0x25
sync.(*WaitGroup).Wait(0xf840069940, 0x4136b8)
	/home/patrick/apps/go/src/pkg/sync/waitgroup.go:78 +0xf2
main._func_005(0xf840077260, 0xf840077258, 0x0, 0x0)
	/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:391 +0x28
created by main.generateSitemap
	/home/patrick/Dropbox/Projects/go/src/github.com/pmylund/osg/main.go:393 +0x58e

goroutine 5 [syscall]:
syscall.Syscall6()
	/home/patrick/apps/go/src/pkg/syscall/asm_linux_amd64.s:40 +0x5
syscall.EpollWait(0xf800000007, 0xf8400bd010, 0xa0000000a, 0xffffffff, 0xc, ...)
	/home/patrick/apps/go/src/pkg/syscall/zerrors_linux_amd64.go:1781 +0xa1
net.(*pollster).WaitFD(0xf8400bd000, 0xf84005b1c0, 0x0, 0x0, 0x0, ...)
	/home/patrick/apps/go/src/pkg/net/fd_linux.go:146 +0x110
net.(*pollServer).Run(0xf84005b1c0, 0x0)
	/home/patrick/apps/go/src/pkg/net/fd.go:236 +0xe4
created by net.newPollServer
	/home/patrick/apps/go/src/pkg/net/newpollserver.go:35 +0x382

goroutine 6 [chan receive]:
net.(*pollServer).WaitRead(0xf84005b1c0, 0xf8400b8120, 0xf84005c690, 0xb, 0x1, ...)
	/home/patrick/apps/go/src/pkg/net/fd.go:268 +0x73
net.(*netFD).Read(0xf8400b8120, 0xf8400bf000, 0x100000001000, 0xffffffff, 0xf84005a450, ...)
	/home/patrick/apps/go/src/pkg/net/fd.go:428 +0x1ec
net.(*TCPConn).Read(0xf8400773f8, 0xf8400bf000, 0x100000001000, 0xe7000000000, 0x0, ...)
	/home/patrick/apps/go/src/pkg/net/tcpsock_posix.go:87 +0xce
bufio.(*Reader).fill(0xf84005b300, 0xf840001be0)
	/home/patrick/apps/go/src/pkg/bufio/bufio.go:77 +0xf0
bufio.(*Reader).Peek(0xf84005b300, 0xf800000001, 0xf8400c4001, 0x0, 0x0, ...)
	/home/patrick/apps/go/src/pkg/bufio/bufio.go:102 +0xbc
net/http.(*persistConn).readLoop(0xf840084d80, 0x0)
	/home/patrick/apps/go/src/pkg/net/http/transport.go:521 +0xab
created by net/http.(*Transport).getConn
	/home/patrick/apps/go/src/pkg/net/http/transport.go:382 +0x5df

The instructions don't yield a binary

go get github.com/pmylund/osg
package github.com/golang/net/html: code in directory /Users/me/work/src/github.com/golang/net/html expects import "golang.org/x/net/html"

Relative paths not working

When a link on the root of the website has a link like this index.php it will try to access this page:
http://subdomain.example.comindex.php

Also when the link is ./index.php it will list this site in the sitemap : http://subdomain.example.com./index.php

Error opening [URL]: Get [URL]: x509: certificate signed by unknown authority

When running the tool from the server the site I am trying to index resides on, I get the error:

Error opening [URL]: Get [URL]: x509: certificate signed by unknown authority

Where [URL] is the URL of the site.

When trying to run from another server I have to that same URL I get:

Error opening [URL]: Get [URL]: remote error: internal error

Any ideas? Thank you!

Invalid Tag `urlset`

Google webmaster gives error - "invalid tag:urlset" in generated sitemap with below command -

./sitemap -c 10 -gzip-level 1 -max-crawl 0 --no-lastmod sitemap.xml https://example.com

timestamps not working

hey, i crawl a site, and in the resulting xml file, all the lastmod times are like this:
0001-01-01T00:00:00+07:00

any idea, why this could happen?

my idea was that the crawl function maybe fails to extract the data from the http headers..
lm, err := time.Parse(http.TimeFormat, res.Header.Get("Last-Modified"))
if err != nil {
log.Println("Couldn't parse Last-Modified time in header", err)
}
so i inserted this error checking, but it didnt fire.

any pointers how to debug/fix this greatly appreciated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.