Code Monkey home page Code Monkey logo

sitemapgen4j's Introduction

sitemapgen4j

SitemapGen4j is a library to generate XML sitemaps in Java.

What's an XML sitemap?

Quoting from sitemaps.org:

Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.

Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.

Sitemap 0.90 is offered under the terms of the Attribution-ShareAlike Creative Commons License and has wide adoption, including support from Google, Yahoo!, and Microsoft.

Getting started

The easiest way to get started is to just use the WebSitemapGenerator class, like this:

WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
wsg.addUrl("http://www.example.com/index.html"); // repeat multiple times
wsg.write();

Configuring options

But there are a lot of nifty options available for URLs and for the generator as a whole. To configure the generator, use a builder:

WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .gzip(true).build(); // enable gzipped output
wsg.addUrl("http://www.example.com/index.html");
wsg.write();

To configure the URLs, construct a WebSitemapUrl with WebSitemapUrl.Options.

WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
WebSitemapUrl url = new WebSitemapUrl.Options("http://www.example.com/index.html")
    .lastMod(new Date()).priority(1.0).changeFreq(ChangeFreq.HOURLY).build();
// this will configure the URL with lastmod=now, priority=1.0, changefreq=hourly 
wsg.addUrl(url);
wsg.write();

Configuring the date format

One important configuration option for the sitemap generator is the date format. The W3C datetime standard allows you to choose the precision of your datetime (anything from just specifying the year like "1997" to specifying the fraction of the second like "1997-07-16T19:20:30.45+01:00"); if you don't specify one, we'll try to guess which one you want, and we'll use the default timezone of the local machine, which might not be what you prefer.

// Use DAY pattern (2009-02-07), Greenwich Mean Time timezone
W3CDateFormat dateFormat = new W3CDateFormat(Pattern.DAY); 
dateFormat.setTimeZone(TimeZone.getTimeZone("GMT"));
WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .dateFormat(dateFormat).build(); // actually use the configured dateFormat
wsg.addUrl("http://www.example.com/index.html");
wsg.write();

Lots of URLs: a sitemap index file

One sitemap can contain a maximum of 50,000 URLs. (Some sitemaps, like Google News sitemaps, can contain only 1,000 URLs.) If you need to put more URLs than that in a sitemap, you'll have to use a sitemap index file. Fortunately, WebSitemapGenerator can manage the whole thing for you.

WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
for (int i = 0; i < 60000; i++) wsg.addUrl("http://www.example.com/doc"+i+".html");
wsg.write();
wsg.writeSitemapsWithIndex(); // generate the sitemap_index.xml

That will generate two sitemaps for 60K URLs: sitemap1.xml (with 50K urls) and sitemap2.xml (with the remaining 10K), and then generate a sitemap_index.xml file describing the two.

It's also possible to carefully organize your sub-sitemaps. For example, it's recommended to group URLs with the same changeFreq together (have one sitemap for changeFreq "daily" and another for changeFreq "yearly"), so you can modify the lastMod of the daily sitemap without modifying the lastMod of the yearly sitemap. To do that, just construct your sitemaps one at a time using the WebSitemapGenerator, then use the SitemapIndexGenerator to create a single index for all of them.

WebSitemapGenerator wsg;
// generate foo sitemap
wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .fileNamePrefix("foo").build();
for (int i = 0; i < 5; i++) wsg.addUrl("http://www.example.com/foo"+i+".html");
wsg.write();
// generate bar sitemap
wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .fileNamePrefix("bar").build();
for (int i = 0; i < 5; i++) wsg.addUrl("http://www.example.com/bar"+i+".html");
wsg.write();
// generate sitemap index for foo + bar 
SitemapIndexGenerator sig = new SitemapIndexGenerator("http://www.example.com", myFile);
sig.addUrl("http://www.example.com/foo.xml");
sig.addUrl("http://www.example.com/bar.xml");
sig.write();

You could also use the SitemapIndexGenerator to incorporate sitemaps generated by other tools. For example, you might use Google's official Python sitemap generator to generate some sitemaps, and use WebSitemapGenerator to generate some sitemaps, and use SitemapIndexGenerator to make an index of all of them.

Validate your sitemaps

SitemapGen4j can also validate your sitemaps using the official XML Schema Definition (XSD). If you used SitemapGen4j to make the sitemaps, you shouldn't need to do this unless there's a bug in our code. But you can use it to validate sitemaps generated by other tools, and it provides an extra level of safety.

It's easy to configure the WebSitemapGenerator to automatically validate your sitemaps right after you write them (but this does slow things down, naturally).

WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .autoValidate(true).build(); // validate the sitemap after writing
wsg.addUrl("http://www.example.com/index.html");
wsg.write();

You can also use the SitemapValidator directly to manage sitemaps. It has two methods: validateWebSitemap(File f) and validateSitemapIndex(File f).

Google-specific sitemaps

Google can understand a wide variety of custom sitemap formats that they made up, including a Mobile sitemaps, Geo sitemaps, Code sitemaps (for Google Code search), Google News sitemaps, and Video sitemaps. SitemapGen4j can generate any/all of these different types of sitemaps.

To generate a special type of sitemap, just use GoogleMobileSitemapGenerator, GoogleGeoSitemapGenerator, GoogleCodeSitemapGenerator, GoogleCodeSitemapGenerator, GoogleNewsSitemapGenerator, or GoogleVideoSitemapGenerator instead of WebSitemapGenerator.

You can't mix-and-match regular URLs with Google-specific sitemaps, so you'll also have to use a GoogleMobileSitemapUrl, GoogleGeoSitemapUrl, GoogleCodeSitemapUrl, GoogleNewsSitemapUrl, or GoogleVideoSitemapUrl instead of a WebSitemapUrl. Each of them has unique configurable options not available to regular web URLs.

sitemapgen4j's People

Contributors

andrewsmedina avatar dfabulich avatar eximius313 avatar gregorko avatar jbeard6 avatar jiwhiz avatar marcfrederick avatar mkurz avatar ramsrib avatar spekr avatar thorkarlsson avatar twb-foreach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sitemapgen4j's Issues

Unable to create a suffix pattern for sitemaps.

Hi,

While using your library, I noticed there was no way to specify a suffix ( custom numbering per say ) and we have to rely on the the rule that the code automatically appends a number ( sitemap1.xml) when the number of urls exceed a certain max. In the sitemaoGeneratorOptions, just like you have an option to specify a suffix, if we have a way to add a special string pattern suffix which can be customized per need and suffices can be turned on or off using a flag, it serves the purpose a bit better for how you want the numbering to work for sitemaps.

I cloned your repo, and made the change locally, and then tried to push my branch to your repo, so I can submit a PR, but I run into permissions issue. Please let me know what is the right way to approach this, or grant permissions to submit PR, so you can review the change I am proposing.

A quick response is highly appreciated!
Thanks a bunch!
Ankita.

Pagination Support

Thank you for your useful library.
I'm wondering is this library support pagination?

In-memory support

Thanks for sharing this lib, saved me some time already!
It would be great if the lib would be able to create a (file-) result without accessing the hard disk. Any plans for this feature?

Error using maxUrls( .. )

If i write following:

static final int MAX_NEWS_URLS_NUM = 2000;
SitemapGeneratorBuilder<GoogleNewsSitemapGenerator> sgb= GoogleNewsSitemapGenerator.builder(MyURL, MyDirectory).dateFormat(dateFormat);
sgb.maxUrls(MAX_NEWS_URLS_NUM).build();  // here is the Problem

I can not use the function maxUrls(..) because it produce an Error!

Thread safety

For large sites it is desirable to parallelize sitemap generation. Please make SitemapGenerator thread-safe

Publish license

Hi, can you please publish a license? I'd like to use your library but whether I can largely depends on the license its published under.

Autovalidation for GoogleImageSitemapGenerator throws RuntimeException

Using auto validation in combination with a GoogleImageSitemapGenerator fails.

Code extract:

GoogleImageSitemapGenerator generator = GoogleImageSitemapGenerator.builder("https://www.google.com", new File(System.getProperty("java.io.tmpdir")))
                                                                           .gzip(false)
                                                                           .autoValidate(true)
                                                                           .allowEmptySitemap(false)
                                                                           .allowMultipleSitemaps(true)
                                                                           .build();

Image image = new Image.ImageBuilder("https://www.google.com/bug.jpg").build();

generator.addUrl(new GoogleImageSitemapUrl.Options("https://www.google.com/any").images(image)
                                                                                .changeFreq(ChangeFreq.DAILY)
                                                                                .priority(Priority.DEFAULT.getValue())
                                                                                .lastMod(new Date())
                                                                                .build());

generator.write();

Exception:

Exception in thread "main" java.lang.RuntimeException: Sitemap file failed to validate (bug?)
	at com.redfin.sitemapgenerator.SitemapGenerator.writeSiteMap(SitemapGenerator.java:280)
	at com.redfin.sitemapgenerator.SitemapGenerator.write(SitemapGenerator.java:173)
	at com.redfin.sitemapgenerator.GoogleImageSitemapGenerator.write(GoogleImageSitemapGenerator.java:11)
	at be.netmediaeurope.promoplatform.promobutler.controllers.sitemap.service.v2.delegate.ProducerDetailSitemapGeneratorDelegate.main(ProducerDetailSitemapGeneratorDelegate.java:108)
Caused by: org.xml.sax.SAXParseException; lineNumber: 8; columnNumber: 18; cvc-complex-type.2.4.c: The matching wildcard is strict, but no declaration can be found for element 'image:image'.

Missing .xsd
sitemap-image.xsd

Add image to GoogleLingSitemapUrl

Currently we can add alternatees by using GoogleLinkSitemapUrl and images by using GoogleImageSitemapUrl. In my case I want to have both at the same time to achieve following output:

<url>
  <loc>http://www.example.com/en/product-1</loc>
  <xhtml:link rel="alternate" hreflang="de" href="http://www.example.com/de/product-1" />
  <xhtml:link rel="alternate" hreflang="en" href="http://www.example.com/en/product-1" />
  <image:image>
   <image:loc>http://www.example.com/image1.jpg</image:loc>
  </image:image>
  <image:image>
   <image:loc>http://www.example.com/image2.jpg</image:loc>
  </image:image>
</url>

What do you think? My idea is to add images to GoogleLinkSitemapUrl and maybe deprecate GoogleImageSitemapUrl. I will implement it myself but want to confirm solution with the community first

Date validation fails

Hello,

the following code snippet causes a date validation failure. I guess its because the seconds are 0 and missing in the rendered date string.

        WebSitemapGenerator generator = WebSitemapGenerator.builder(
                "https://github.com", new File( filePath ))
                .autoValidate(true).build();
        generator.addUrl(
                new WebSitemapUrl.Options("https://github.com/dfabulich/sitemapgen4j")
                        .lastMod(new Date(1408464120000L))
                        .build() );
        generator.write();

The error:

Caused by: org.xml.sax.SAXParseException; lineNumber: 5; columnNumber: 46; cvc-datatype-valid.1.2.3: "2014-08-19T18:02+02:00" ist kein gültiger Wert des Vereinigungsmengentyps "tLastmod".
    at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203)
    at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.error(ErrorHandlerWrapper.java:134)
    at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:437)
    at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
    at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:325)
    at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator$XSIErrorReporter.reportError(XMLSchemaValidator.java:458)
    at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.reportSchemaError(XMLSchemaValidator.java:3237)
    at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.elementLocallyValidType(XMLSchemaValidator.java:3152)
    at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.processElementContent(XMLSchemaValidator.java:3062)
    at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.handleEndElement(XMLSchemaValidator.java:2140)
    at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.endElement(XMLSchemaValidator.java:859)
    at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorHandlerImpl.endElement(ValidatorHandlerImpl.java:584)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2973)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
    at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorHandlerImpl.validate(ValidatorHandlerImpl.java:730)
    at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorImpl.validate(ValidatorImpl.java:102)
    at javax.xml.validation.Validator.validate(Validator.java:124)
    at com.redfin.sitemapgenerator.SitemapValidator.validateXml(SitemapValidator.java:75)
    at com.redfin.sitemapgenerator.SitemapValidator.validateWebSitemap(SitemapValidator.java:61)
    at com.redfin.sitemapgenerator.SitemapGenerator.writeSiteMap(SitemapGenerator.java:238)

error with gzip(true) and autovalidate(true)

CODE:

WebSitemapGenerator wsg;
        // generate foo sitemap
        wsg = WebSitemapGenerator.builder("http://www.example.com", new File(OUT_DIR)).fileNamePrefix("sitemap_big").gzip(true).autoValidate(true).build();
        for (int i = 0; i < 49999; i++)
            wsg.addUrl("http://www.example.com/foo" + i + ".html");
        wsg.write();
        wsg.writeSitemapsWithIndex(); // generate the sitemap_index.xml

Exception:

Exception in thread "main" java.lang.RuntimeException: Sitemap file failed to validate (bug?)
    at com.redfin.sitemapgenerator.SitemapGenerator.writeSiteMap(SitemapGenerator.java:248)
    at com.redfin.sitemapgenerator.SitemapGenerator.write(SitemapGenerator.java:169)
    at com.redfin.sitemapgenerator.WebSitemapGenerator.write(WebSitemapGenerator.java:12)
    at com.nttdata.sitemap.test.SitemapBuilderTest.main(SitemapBuilderTest.java:19)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Il contenuto non è consentito nel prologo.
    at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
    at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
    at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
    at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
    at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:998)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)

Sitemap Limit Issue

Though the max URL support for a sitemap is 50K (MAX_SITEMAPS_PER_INDEX), but the files are not getting submitted in Google Search Control, with an error "Couldn't fetch".

For example, I had a sitemap file with 12K links, and I failed to submit it by all means. Later, I had to keep only first 500 links then it started working.

Further, I strongly recommend adding an option to API to set custom limit for URLs for each child sitemap file or make MAX_SITEMAPS_PER_INDEX as not final.

Allow Empty Files

We have a dynamic site in which sitemaps may periodically be empty. Crawlers, however, are expecting to find the sitemap files at a known location and so, at these times, we need the ability to publish an empty sitemap and/or index.

Please add an option to allow empty sitemaps and index files to be written. It need not be the default.

Escape XML entities

Having & in URL causes sitemapgen4j to produce invalid XML:

WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", dir);
wsg.addUrl("http://www.example.com/Tips&Tricks.html");
wsg.write();

Outcome:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" >
  <url>
    <loc>http://www.example.com/Tips&Tricks.html</loc>
  </url>
</urlset>

instead of:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" >
  <url>
    <loc>http://www.example.com/Tips&amp;Tricks.html</loc>
  </url>
</urlset>

Accessing ISitemapUrl interface throws IllegalAccessError

Trying to use the following code snippet in my Kotlin project.

WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
WebSitemapUrl url = new WebSitemapUrl.Options("http://www.example.com/index.html")
    .lastMod(new Date()).priority(1.0).changeFreq(ChangeFreq.HOURLY).build();
wsg.addUrl(url);
wsg.write();

I'm getting the following error during the runtime.

Exception in thread "main" java.lang.IllegalAccessError: tried to access class com.redfin.sitemapgenerator.ISitemapUrl from class Main
	at Main.run(Main.kt:20)
	at Main$Companion.main(Main.kt:38)
	at Main.main(Main.kt)

The ISitemapUrl interface is not public and it's creating the problem when i try to access the wsg.addUrl(url).

Solution:

  • Changing the scope of the ISitemapUrl interface from default to public fixes the issue.

Missing documentation

Firstly, Thank you guys created sitemapgen4j, it was really appreciated.
Sorry if I wrong when writing this post. I didn't see any full user guide or docs of sitemapgen4j, so I don't what the ChangeFreq used for? Who can help me to explain about it?

Thank you in advance. Btw pointing me if we have already the documentation of this open-source.

Add A License

An explicit license should be added to the repos.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.