If you are planning on importing more than 20 million rows (or more than 10K requests per minute as part of your import), please contact us at sales@posthog.com so we can make sure you are not rate limited.
Historical data ingestion (or importing data), opposed to live data ingestion, is the process of transporting data from external sources into PostHog so you can benefit from PostHog product analytics on historical data. It may be that you have historical data that you want to analyze along with new live data or that you have a requirement to periodically import data from third-party sources to augment your live data.
Whatever the reason for the historical data ingestion, this guide covers what to consider during that process.
The three main factors to consider are:
- Data ingestion process: how to get the event data from the third-party source into PostHog
- Importing events: Sending the events captured in the third-party data source into PostHog as custom events
- User identification: How to identify users within PostHog and ties those users back to the user within original data source
Historical data is sent to PostHog using either a server library or the PostHog API. For more information see the importing events section.
Data ingestion process
Since the third-party data source will offer an API you can use the power of software to import the data from one or more sources to PostHog using the PostHog API.
The following factors are important in the export and then import process:
- The volume of data
- API rate limits of both the data source and PostHog
- Ensure that only the data required for events is exported
- Handle error scenarios allowing the process to resume form the last successful point
With the above factors in mind it's recommended that you break the process up into steps such as the following:
Sequentially export selective data from the data source that represent key events keeping track of where you are in the sequence so that you can restart the process from the last successful point if any problems occur
Store the selected exported data to a new data storage for faster access in future steps
Transform the data to the format you will use with PostHog and again save to a storage mechanism for faster access in later steps.
The data format should be:
JSON{"event": "event_name","distinct_id": "distinct_id_of_your_user","properties": {"key1": "value1","key2": "value2"},"timestamp": "[optional timestamp in ISO 8601 format]"}At this stage it's also important to consider the following:
- Use the same
event
name that you're going to use with your live data ingestion so the historical and live events are seen as the same type within PostHog - Use the same unique identifier within the
distinct_id
field as you are within your live data ingestion so historical events and live events are associated with the same user - Convert the old event property names to the new event property names you are using within the events in your live data ingestion
- Ensure that the
timestamp
is a converted version of the original timestamp is ISO 8601 format so that PostHog correctly identifies when the original event occurred - You may want to set an additional property within
properties
that identifies the original event within the data source
- Use the same
Sequentially import the events into PostHog keeping track of the last successfully imported event so that you can restart the process from the last successful point if any problems occur
Importing events
Once you are ready to import the data into PostHog you can use one of the following:
- a PostHog server libraries
- the PostHog API
As mentioned above, the data should be in the following format:
{"event": "event_name","distinct_id": "distinct_id_of_your_user","properties": {"key1": "value1","key2": "value2"},"timestamp": "[optional timestamp in ISO 8601 format]"}
The server libraries handle batching capture requests. If you decide to use the API directly you will need to manage this yourself.
client.capture({distinctId: 'distinct_id_of_the_user',event: 'movie_played',properties: {movie_id: '123',category: 'romcom'}})
For more information see the Node.js docs.
User identification
As discussed within the data ingestion process section, a unique user identifier distinct_id
should be set for each event. In addition to setting the user with each event you can enrich information about that user by adding more properties:
client.identify({distinctId: "distinct_id_of_your_user",properties: {email: 'john@doe.com',proUser: false}})
For more information see the Node.js docs.