Unifying Your Server-Side Data: Joining Raw GTM SC Events with GA4 Export in BigQuery for Holistic Analytics
Unifying Your Server-Side Data: Joining Raw GTM SC Events with GA4 Export in BigQuery for Holistic Analytics
You've successfully built a sophisticated server-side Google Analytics 4 (GA4) pipeline, leveraging Google Tag Manager (GTM) Server Container on Cloud Run to centralize data collection, apply transformations, enrich events, and enforce granular consent. You've even gone further, establishing a raw event data lake in BigQuery to capture every server-side event, and enabling the GA4 BigQuery Export for deep analytics.
However, a critical challenge often emerges when you have these two powerful datasets: how do you bring them together into a single, cohesive view? You have the pristine, untransformed raw events in your data lake, and the GA4-processed, sessionized, and user-attributed data in the GA4 export. Both are invaluable, but operating them in silos means you're missing opportunities for true holistic analysis, custom attribution, and data reconciliation.
The problem is the need to bridge the gap between your raw server-side events and GA4's processed view, enabling you to construct a unified user activity model in BigQuery. This model can then power advanced analytics that go beyond what either dataset offers individually, reconcile discrepancies, and provide a single source of truth for your digital data.
The Challenge: Two Sides of the Same Coin
Let's briefly recap the nature of these two datasets:
1. Raw Server-Side Event Data Lake (e.g., raw_events_data_lake.raw_incoming_events)
- Source: Direct capture from your GTM Server Container before any GA4-specific processing.
- Content: Contains the exact JSON payload your GTM SC received (and potentially added basic internal metadata like
_processed_event_id,_request_headers). - Advantages:
- Untransformed: Captures data as close to the source as possible, ideal for auditing, debugging, and schema evolution.
- Full Ownership: You own every byte, independent of vendor processing rules.
- Detailed Context: Can include client IP, raw User-Agent, and other HTTP headers if configured.
- Disadvantages:
- No GA4 Context: Lacks GA4-specific constructs like
session_id,user_pseudo_id,gclid/dclidparsing, or automatic event groupings. - Less Structured: Might require more transformation before direct analysis.
- No GA4 Context: Lacks GA4-specific constructs like
2. GA4 BigQuery Export (e.g., analytics_YOUR_GA4_PROPERTY_ID.events_*)
- Source: Data exported directly from Google Analytics 4.
- Content: Processed and structured according to GA4's schema. Includes GA4's interpretation of sessions, users, and events.
- Advantages:
- GA4 Context: Includes GA4's
user_pseudo_id,ga_session_id,event_timestamp,event_paramsparsed intostring_value,int_value, etc. - Enriched: Often contains values derived by GA4 (e.g.,
gclidauto-parsing, traffic source dimensions). - Direct Correlation: Easily map back to reports in the GA4 UI.
- GA4 Context: Includes GA4's
- Disadvantages:
- Transformed/Filtered: Data is after GA4's processing, which might involve sampling, filtering, and schema adaptations not always transparent.
- Vendor-Dependent Schema: Tied to GA4's evolving schema.
- Delayed: Data usually arrives daily (or near real-time with 360), not truly immediate.
The challenge is clear: each dataset is valuable but incomplete for a truly holistic view. You need the granular detail and ownership of your raw events combined with the powerful session and user context that GA4 provides.
Our Solution: A Unified User Activity Data Model in BigQuery
Our solution involves creating a unified user activity data model in BigQuery. This model will join your raw server-side events with the corresponding GA4 exported events, creating a richer, more comprehensive view of user interactions.
By leveraging common identifiers, particularly a consistent event_id (as discussed in Server-Side Event Deduplication), we can precisely link these two datasets.
Architecture: Bringing Data Together in BigQuery
The core idea is to use BigQuery's powerful SQL capabilities to join these two streams:
graph TD
subgraph Google Cloud Data Pipeline
A[Client-Side Events] -->|HTTP Request| B(GTM Server Container on Cloud Run);
B -->|1. Raw Event to Ingestion Service| C(Raw Event Ingestion Service on Cloud Run);
C -->|2. Stream to BigQuery| D[BigQuery Raw Event Data Lake];
B -->|3. Processed Event to GA4 MP| E[Google Analytics 4];
E -->|4. Daily/Near Real-time Export| F[BigQuery GA4 Export];
end
subgraph BigQuery Unified Layer
D -- SQL Join (on event_id, client_id, timestamp) --> G(BigQuery Unified User Activity Model);
F -- SQL Join (on event_id, client_id, timestamp) --> G;
end
G --> H[Looker Studio (Dashboards)];
G --> I[BigQuery ML (Advanced Analytics)];
G --> J[Custom BI Tools];
Key Steps:
- GTM SC sends raw event to Data Lake: As covered in Building a Server-Side Event Data Lake, every incoming server-side event is captured in its raw form in
raw_events_data_lake.raw_incoming_events. Crucially, this event includes a_processed_event_id(a UUID generated by GTM SC). - GTM SC sends processed event to GA4: The GTM SC transforms, enriches, and dispatches the event to GA4 via the Measurement Protocol. This event includes the same
_processed_event_idmapped to GA4's_eidevent parameter. - GA4 exports data to BigQuery: GA4 processes the event, assigns session and user IDs, and then exports it to your
analytics_YOUR_GA4_PROPERTY_ID.events_*tables. The_eidparameter will be present here. - BigQuery Joins for a Unified Model: In BigQuery, you'll write SQL queries to join the
raw_incoming_eventstable with theevents_*tables, primarily using the_processed_event_idand other identifiers for robustness.
Core Components & Implementation Steps
1. Prerequisites Check
Before you start joining, ensure you have:
- Raw Event Data Lake in BigQuery: Your
raw_events_data_lake.raw_incoming_eventstable is populated with events, including aprocessed_event_idcolumn.- Self-check: Does your
raw_incoming_eventstable have a columnprocessed_event_id(STRING) populated with the UUID generated by yourUniversal Event ID ResolverGTM SC template?
- Self-check: Does your
- GA4 BigQuery Export Enabled: Your
analytics_YOUR_GA4_PROPERTY_ID.events_*tables are regularly receiving data from GA4.- Self-check: For purchase events, is your
transaction_idparameter (event_params.key = 'transaction_id') populated? For all events, is the_eidparameter (event_params.key = '_eid') populated with the same UUID?
- Self-check: For purchase events, is your
If not, revisit the Server-Side Event Deduplication blog to ensure _processed_event_id (mapped to _eid for GA4) is consistently generated and used.
2. Identifying Join Keys
The most robust keys for joining are:
- Primary:
_processed_event_id(from raw data) /_eid(from GA4 export). This is the ideal, unique link for a single event instance. - Secondary/Fallback:
client_id(from raw data) /user_pseudo_id(from GA4 export). This links events to the same GA4 user.event_timestamp(from both). Can be used to find events that occurred very close in time, especially if_eidis missing or inconsistent.
3. BigQuery SQL for Unification
Let's create a unified view. We'll extract the _eid from event_params in the GA4 export table and use it to join with our raw events.
-- Replace with your actual project IDs and GA4 property ID
CREATE OR REPLACE VIEW `your_gcp_project.unified_analytics.unified_user_activity` AS
SELECT
-- Raw Event Data Fields
raw.event_timestamp AS raw_event_timestamp,
raw.event_name AS raw_event_name,
raw.client_id AS raw_client_id,
raw.processed_event_id AS unified_event_id,
JSON_EXTRACT_SCALAR(raw.payload, '$.page_location') AS raw_page_location, -- Example of extracting from JSON payload
raw.payload AS raw_event_payload, -- Keep the full raw payload for deep dives
-- GA4 Export Data Fields
ga4.event_timestamp AS ga4_event_timestamp,
ga4.event_name AS ga4_event_name,
ga4.user_pseudo_id AS ga4_user_pseudo_id,
(SELECT value.string_value FROM UNNEST(ga4.event_params) WHERE key = 'ga_session_id') AS ga4_session_id,
(SELECT value.string_value FROM UNNEST(ga4.event_params) WHERE key = 'page_location') AS ga4_page_location,
(SELECT value.int_value FROM UNNEST(ga4.event_params) WHERE key = 'engagement_time_msec') AS ga4_engagement_time_msec,
(SELECT value.string_value FROM UNNEST(ga4.event_params) WHERE key = 'campaign') AS ga4_campaign,
(SELECT value.string_value FROM UNNEST(ga4.event_params) WHERE key = 'source') AS ga4_source,
(SELECT value.string_value FROM UNNEST(ga4.event_params) WHERE key = 'medium') AS ga4_medium,
ga4.event_params, -- Keep the full GA4 event_params array for flexible querying
ga4.user_properties,
ga4.traffic_source.name AS ga4_traffic_source_name,
ga4.geo.country AS ga4_geo_country,
ga4.device.category AS ga4_device_category
FROM
`your_gcp_project.raw_events_data_lake.raw_incoming_events` AS raw
LEFT JOIN
`your_gcp_project.analytics_YOUR_GA4_PROPERTY_ID.events_*` AS ga4
ON
raw.processed_event_id = (SELECT value.string_value FROM UNNEST(ga4.event_params) WHERE key = '_eid')
-- Add a timestamp window for robustness, especially if _eid isn't always perfectly consistent
AND ABS(TIMESTAMP_DIFF(raw.event_timestamp, TIMESTAMP_MICROS(ga4.event_timestamp), SECOND)) < 60 -- Events within 60 seconds
-- Filter for a specific date range, e.g., the last 7 days
WHERE
DATE(raw.event_timestamp) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE()
AND PARSE_DATE('%Y%m%d', ga4._TABLE_SUFFIX) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE()
;
Explanation of the SQL:
CREATE OR REPLACE VIEW: Creates a logical view that acts like a table, but the data is computed each time the view is queried. For performance, you mightCREATE OR REPLACE TABLEdaily instead (Materialized View pattern).your_gcp_project.raw_events_data_lake.raw_incoming_events: Your raw server-side event table.your_gcp_project.analytics_YOUR_GA4_PROPERTY_ID.events_*: Your GA4 BigQuery export table. The*acts as a wildcard for daily tables.LEFT JOIN: We start with all raw events and try to find matching GA4 events. This ensures we don't lose any raw event data, even if it didn't make it to GA4 for some reason.ON raw.processed_event_id = (SELECT value.string_value FROM UNNEST(ga4.event_params) WHERE key = '_eid'): This is the critical join condition, linking events via their unique server-side generated ID.AND ABS(TIMESTAMP_DIFF(raw.event_timestamp, TIMESTAMP_MICROS(ga4.event_timestamp), SECOND)) < 60: This adds a secondary check, ensuring that the raw event and the GA4 event occurred within a 60-second window. This makes the join more robust against_eiddiscrepancies or if an_eidis reused (though it shouldn't be for unique events).TIMESTAMP_MICROSis needed becausega4.event_timestampis in microseconds.JSON_EXTRACT_SCALAR(raw.payload, '$.page_location'): An example of how to extract specific fields from theraw_event_payload(which is stored asJSONorSTRINGin your raw data lake). This allows you to pull out details that might not be directly available in the GA4 export or are valuable for comparison.UNNEST(ga4.event_params): Essential for querying nestedevent_paramsanduser_propertiesin GA4 export.
4. Post-Processing: Custom Attribution & Reconciliation
With your unified_user_activity view or table, you can now:
-
Custom Attribution Modeling:
- Track the
unified_event_idthrough your pipeline. - Build custom attribution models that consider all touchpoints from your raw data (e.g., specific HTTP referrers, custom API calls) combined with GA4's user and session context.
- Example: Attribute a conversion not just to the last GA4
medium, but to a specificraw_event_payloaddetail that was captured server-side but not sent to GA4.
-- Example: Multi-touch attribution focusing on raw referrer for first touch SELECT unified_event_id, ga4_user_pseudo_id, MIN(raw_event_timestamp) OVER (PARTITION BY ga4_user_pseudo_id) AS first_touch_timestamp, FIRST_VALUE(raw_page_location) OVER (PARTITION BY ga4_user_pseudo_id ORDER BY raw_event_timestamp ASC) AS first_touch_page, FIRST_VALUE(JSON_EXTRACT_SCALAR(raw_event_payload, '$._request_headers.referer')) OVER (PARTITION BY ga4_user_pseudo_id ORDER BY raw_event_timestamp ASC) AS first_touch_referrer, -- And last touch attribution for purchase events LAST_VALUE(ga4_campaign) OVER (PARTITION BY unified_event_id ORDER BY ga4_event_timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_touch_campaign, LAST_VALUE(ga4_source) OVER (PARTITION BY unified_event_id ORDER BY ga4_event_timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_touch_source FROM `your_gcp_project.unified_analytics.unified_user_activity` WHERE ga4_event_name = 'purchase' ; - Track the
-
Data Reconciliation & Debugging:
- Identify events that were captured raw but not sent to GA4 (where
ga4_event_nameisNULL). This could indicate GTM SC misconfigurations, consent blocks, or network issues to GA4. - Identify events sent to GA4 that have no raw counterpart (where
raw_event_nameisNULL). This might indicate client-side GA4 implementation not going through your GTM SC, or direct Measurement Protocol hits not captured by your raw logger. - Compare
raw_event_namevs.ga4_event_nameto see how GTM SC transformations affect event naming. - Compare
raw_event_timestampvs.ga4_event_timestampfor latency analysis.
-- Events in Raw Data Lake but not in GA4 Export SELECT * FROM `your_gcp_project.unified_analytics.unified_user_activity` WHERE ga4_event_name IS NULL; -- Events in GA4 Export but no matching raw data (less common with proper SST) SELECT * FROM `your_gcp_project.unified_analytics.unified_user_activity` WHERE raw_event_name IS NULL; -- Compare specific parameters SELECT unified_event_id, JSON_EXTRACT_SCALAR(raw_event_payload, '$.value') AS raw_value, (SELECT value.double_value FROM UNNEST(ga4.event_params) WHERE key = 'value') AS ga4_value FROM `your_gcp_project.unified_analytics.unified_user_activity` WHERE raw_event_name = 'purchase' AND ga4_event_name = 'purchase' AND JSON_EXTRACT_SCALAR(raw_event_payload, '$.value') != (SELECT CAST(value.double_value AS STRING) FROM UNNEST(ga4.event_params) WHERE key = 'value'); - Identify events that were captured raw but not sent to GA4 (where
-
Deeper User Journey Analysis:
- Combine raw HTTP headers (e.g.,
User-Agentfrom_request_headersinraw_event_payload) with GA4's session and device data for enhanced bot detection or device segmentation. - Trace the exact raw event payload leading up to a specific GA4 conversion.
- Combine raw HTTP headers (e.g.,
Benefits of This Unification Approach
- Holistic View of User Activity: Get the best of both worlds – the raw truth of your server-side data combined with GA4's powerful user and session context.
- Enhanced Data Trust: Reconcile discrepancies and debug issues by comparing raw and processed data side-by-side, building confidence in your analytics.
- Custom Attribution Beyond GA4: Build bespoke attribution models tailored to your business, incorporating any data point captured by your server-side pipeline.
- Deeper Insights & Data Science: Fuel advanced analytics, machine learning models, and custom dashboards (e.g., in Looker Studio) with a richer, more complete dataset.
- Future-Proofing: Decouple your core insights from any single vendor's processing by creating your own robust, unified data layer.
- Improved Compliance & Auditability: The raw data serves as an immutable audit trail, now easily linked to user and session context.
Important Considerations
- Cost: Running large JOIN queries in BigQuery can incur significant costs. Ensure your tables are properly partitioned and clustered, and use
CREATE OR REPLACE TABLEfor materialized views rather than querying theVIEWdirectly for frequently accessed data. - Data Volume: This approach is best for organizations with moderate to high data volumes where the value of detailed insights outweighs the costs.
- Schema Evolution: Your raw data lake should be flexible (e.g.,
JSONtype for payload) to handle schema changes. GA4's schema also evolves; yourUNNESTqueries need to be robust. - PII Handling: Be mindful that your raw data lake might contain PII if not pre-scrubbed. Ensure strict access controls and retention policies are in place, as your unified model could expose this PII alongside GA4's pseudo-anonymous IDs.
- Latency: Building these unified tables or views is typically an overnight or hourly batch process, not real-time. The GA4 export itself has a delay.
Conclusion
Having both a raw server-side event data lake and a GA4 BigQuery export is a testament to a mature data engineering strategy. However, the true power is unleashed when these two datasets are intelligently unified. By leveraging BigQuery's SQL capabilities to join your raw GTM Server Container events with GA4's processed data, you gain unparalleled clarity into user behavior, reconcile data discrepancies, and build highly customized attribution models. This holistic approach elevates your analytics, transforms data trust, and empowers your business with a single, comprehensive source of truth for all digital interactions. Embrace this unification to unlock the full potential of your server-side data.