As the owner and architect for the big data products at Pitney Bowes, I’m always trying to make sure that the products we build solve real problems for you, our customers. Version 3.1 of our geocoder for Hadoop has some nice additions based on valuable feedback from our users. In past versions of the product, we had a Spark driver application that was essentially a black box - it took a CSV and some preferences as input and produced essentially the same CSV with some additional columns from the geocoder. This was a great place to start but we quickly realized that we needed to do more. We were hearing comments like:
- What if my data isn't in a CSV (e.g. Parquet, ORC, HBase)?
- How do I do multiple passes geocoding passes?
- How can I add my own business logic to the driver (e.g. my own score that represents the quality of the geocode)?
It became obvious that you wanted more flexibility in how you could use our geocoding engine – starting with a Spark API that allows you to integrate the geocoder into your own Spark jobs. In Spark, data is often represented as a Dataframe or a Dataset. If you have address data in this type of structure, you can now use our Spark API to geocode and standardize it. If you have additional processing (multiple passes or some other business logic), you can easily integrate that into your Spark job. Additionally, this same API can be used with other Spark deployments like Databricks.
If you are just getting started with Spark, our new driver application demonstrates how to use the Spark API. It does everything our previous driver did - but for users that want to customize it, we include the source code and a Gradle build. This makes it easy to make changes in the business logic, rebuild it, and re-deploy it. That way, the product gives you both a good out-of-the-box experience, and the customization needed for real-world use cases.
Learn more about our big data products at: http://support.pb.com/help/hadoop/landingpage/index.html
…or post a question to the Community.