

The long-running, I/O intensive batch jobs were set up to use the snapshot hFiles in Amazon’s Simple Storage Service (S3) as input. In order to avoid hitting the live HBase (which was also serving the onlin e system, as shown above), we took daily snapshots of HBase tables and copied the snapshot hFiles to S3 in an incremental manner. These batch jobs are mainly map-reduce jobs, and are I/O intensive. In addition to the model training process, we also have a collection of batch jobs for reporting, billing, experimenting, and so on. We have a complex and thorough batch-training process that looks at historical data to build the models periodically. Sift uses multiple machine learning models to identify different kinds of fraud.
#Hbase archive cleaner Offline
A simplified view of Sift’s online and offline systems looked like this in 2018: In addition to HBase, we had other smaller datastores like Elasticsearch, Amazon RDS, MongoDB, and so on, all hosted in AWS, and performing critical tasks within the Sift platform. Over the years, we also built numerous tools to manage HBase (merge, split, and compact regions, do cluster replication, snapshot, backup, backfill, monitor, and failover). We provisioned and hosted HBase ourselves on Amazon Elastic Compute Cloud (EC2) instances. We used Apache HBase as our main datastore, primarily for its strong consistency model and its tight coupling with the Apache Hadoop/Map-Reduce ecosystem. The Sift platform was hosted in the AWS cloud, powered by Java web applications.
#Hbase archive cleaner series
The target audience of this four-part series ( part-2, part-3, part-4) is technology enthusiasts and engineering teams looking to do similar cross cloud migrations. We recently migrated our cloud infrastructure from Amazon Web Services (AWS) to the Google Cloud Platform (GCP) to meet our growing business needs, and we would like to share our migration experience with the engineering community.

When our customers query us about the fraudulent nature of an online event on their systems, we respond to them in real time in the form of a trust sco re of that event. We ingest, process, and store petabytes of data from our customers from around the globe. At Sift, we have built a scalable and highly available technology platform to enhance trust and safety in the digital world.
