Unifying Your Server-Side Data: Joining Raw GTM SC Events with GA4 Export in BigQuery for Holistic Analytics

You've successfully built a sophisticated server-side Google Analytics 4 (GA4) pipeline, leveraging Google Tag Manager (GTM) Server Container on Cloud Run to centralize data collection, apply transformations, enrich events, and enforce granular consent. You've even gone further, establishing a raw event data lake in BigQuery to capture every server-side event, and enabling the GA4 BigQuery Export for deep analytics.

However, a critical challenge often emerges when you have these two powerful datasets: how do you bring them together into a single, cohesive view? You have the pristine, untransformed raw events in your data lake, and the GA4-processed, sessionized, and user-attributed data in the GA4 export. Both are invaluable, but operating them in silos means you're missing opportunities for true holistic analysis, custom attribution, and data reconciliation.

The problem is the need to bridge the gap between your raw server-side events and GA4's processed view, enabling you to construct a unified user activity model in BigQuery. This model can then power advanced analytics that go beyond what either dataset offers individually, reconcile discrepancies, and provide a single source of truth for your digital data.

The Challenge: Two Sides of the Same Coin

Let's briefly recap the nature of these two datasets:

1. Raw Server-Side Event Data Lake (e.g., `raw_events_data_lake.raw_incoming_events`)

Source: Direct capture from your GTM Server Container before any GA4-specific processing.
Content: Contains the exact JSON payload your GTM SC received (and potentially added basic internal metadata like _processed_event_id, _request_headers).
Advantages:
- Untransformed: Captures data as close to the source as possible, ideal for auditing, debugging, and schema evolution.
- Full Ownership: You own every byte, independent of vendor processing rules.
- Detailed Context: Can include client IP, raw User-Agent, and other HTTP headers if configured.
Disadvantages:
- No GA4 Context: Lacks GA4-specific constructs like session_id, user_pseudo_id, gclid/dclid parsing, or automatic event groupings.
- Less Structured: Might require more transformation before direct analysis.

2. GA4 BigQuery Export (e.g., `analytics_YOUR_GA4_PROPERTY_ID.events_*`)

Source: Data exported directly from Google Analytics 4.
Content: Processed and structured according to GA4's schema. Includes GA4's interpretation of sessions, users, and events.
Advantages:
- GA4 Context: Includes GA4's user_pseudo_id, ga_session_id, event_timestamp, event_params parsed into string_value, int_value, etc.
- Enriched: Often contains values derived by GA4 (e.g., gclid auto-parsing, traffic source dimensions).
- Direct Correlation: Easily map back to reports in the GA4 UI.
Disadvantages:
- Transformed/Filtered: Data is after GA4's processing, which might involve sampling, filtering, and schema adaptations not always transparent.
- Vendor-Dependent Schema: Tied to GA4's evolving schema.
- Delayed: Data usually arrives daily (or near real-time with 360), not truly immediate.

The challenge is clear: each dataset is valuable but incomplete for a truly holistic view. You need the granular detail and ownership of your raw events combined with the powerful session and user context that GA4 provides.

Our Solution: A Unified User Activity Data Model in BigQuery

Our solution involves creating a unified user activity data model in BigQuery. This model will join your raw server-side events with the corresponding GA4 exported events, creating a richer, more comprehensive view of user interactions.

By leveraging common identifiers, particularly a consistent event_id (as discussed in Server-Side Event Deduplication), we can precisely link these two datasets.

Architecture: Bringing Data Together in BigQuery

The core idea is to use BigQuery's powerful SQL capabilities to join these two streams:

graph TD
    subgraph Google Cloud Data Pipeline
        A[Client-Side Events] -->|HTTP Request| B(GTM Server Container on Cloud Run);
        B -->|1. Raw Event to Ingestion Service| C(Raw Event Ingestion Service on Cloud Run);
        C -->|2. Stream to BigQuery| D[BigQuery Raw Event Data Lake];
        B -->|3. Processed Event to GA4 MP| E[Google Analytics 4];
        E -->|4. Daily/Near Real-time Export| F[BigQuery GA4 Export];
    end

    subgraph BigQuery Unified Layer
        D -- SQL Join (on event_id, client_id, timestamp) --> G(BigQuery Unified User Activity Model);
        F -- SQL Join (on event_id, client_id, timestamp) --> G;
    end

    G --> H[Looker Studio (Dashboards)];
    G --> I[BigQuery ML (Advanced Analytics)];
    G --> J[Custom BI Tools];

Key Steps:

GTM SC sends raw event to Data Lake: As covered in Building a Server-Side Event Data Lake, every incoming server-side event is captured in its raw form in raw_events_data_lake.raw_incoming_events. Crucially, this event includes a _processed_event_id (a UUID generated by GTM SC).
GTM SC sends processed event to GA4: The GTM SC transforms, enriches, and dispatches the event to GA4 via the Measurement Protocol. This event includes the same _processed_event_id mapped to GA4's _eid event parameter.
GA4 exports data to BigQuery: GA4 processes the event, assigns session and user IDs, and then exports it to your analytics_YOUR_GA4_PROPERTY_ID.events_* tables. The _eid parameter will be present here.
BigQuery Joins for a Unified Model: In BigQuery, you'll write SQL queries to join the raw_incoming_events table with the events_* tables, primarily using the _processed_event_id and other identifiers for robustness.

Core Components & Implementation Steps

1. Prerequisites Check

Before you start joining, ensure you have:

Raw Event Data Lake in BigQuery: Your raw_events_data_lake.raw_incoming_events table is populated with events, including a processed_event_id column.
- Self-check: Does your raw_incoming_events table have a column processed_event_id (STRING) populated with the UUID generated by your Universal Event ID Resolver GTM SC template?
GA4 BigQuery Export Enabled: Your analytics_YOUR_GA4_PROPERTY_ID.events_* tables are regularly receiving data from GA4.
- Self-check: For purchase events, is your transaction_id parameter (event_params.key = 'transaction_id') populated? For all events, is the _eid parameter (event_params.key = '_eid') populated with the same UUID?

If not, revisit the Server-Side Event Deduplication blog to ensure _processed_event_id (mapped to _eid for GA4) is consistently generated and used.

2. Identifying Join Keys

The most robust keys for joining are:

Primary: _processed_event_id (from raw data) / _eid (from GA4 export). This is the ideal, unique link for a single event instance.
Secondary/Fallback:
- client_id (from raw data) / user_pseudo_id (from GA4 export). This links events to the same GA4 user.
- event_timestamp (from both). Can be used to find events that occurred very close in time, especially if _eid is missing or inconsistent.

3. BigQuery SQL for Unification

Let's create a unified view. We'll extract the _eid from event_params in the GA4 export table and use it to join with our raw events.

-- Replace with your actual project IDs and GA4 property ID
CREATE OR REPLACE VIEW `your_gcp_project.unified_analytics.unified_user_activity` AS
SELECT
    -- Raw Event Data Fields
    raw.event_timestamp AS raw_event_timestamp,
    raw.event_name AS raw_event_name,
    raw.client_id AS raw_client_id,
    raw.processed_event_id AS unified_event_id,
    JSON_EXTRACT_SCALAR(raw.payload, '$.page_location') AS raw_page_location, -- Example of extracting from JSON payload
    raw.payload AS raw_event_payload, -- Keep the full raw payload for deep dives

    -- GA4 Export Data Fields
    ga4.event_timestamp AS ga4_event_timestamp,
    ga4.event_name AS ga4_event_name,
    ga4.user_pseudo_id AS ga4_user_pseudo_id,
    (SELECT value.string_value FROM UNNEST(ga4.event_params) WHERE key = 'ga_session_id') AS ga4_session_id,
    (SELECT value.string_value FROM UNNEST(ga4.event_params) WHERE key = 'page_location') AS ga4_page_location,
    (SELECT value.int_value FROM UNNEST(ga4.event_params) WHERE key = 'engagement_time_msec') AS ga4_engagement_time_msec,
    (SELECT value.string_value FROM UNNEST(ga4.event_params) WHERE key = 'campaign') AS ga4_campaign,
    (SELECT value.string_value FROM UNNEST(ga4.event_params) WHERE key = 'source') AS ga4_source,
    (SELECT value.string_value FROM UNNEST(ga4.event_params) WHERE key = 'medium') AS ga4_medium,
    ga4.event_params, -- Keep the full GA4 event_params array for flexible querying
    ga4.user_properties,
    ga4.traffic_source.name AS ga4_traffic_source_name,
    ga4.geo.country AS ga4_geo_country,
    ga4.device.category AS ga4_device_category

FROM
    `your_gcp_project.raw_events_data_lake.raw_incoming_events` AS raw

LEFT JOIN
    `your_gcp_project.analytics_YOUR_GA4_PROPERTY_ID.events_*` AS ga4
ON
    raw.processed_event_id = (SELECT value.string_value FROM UNNEST(ga4.event_params) WHERE key = '_eid')
    -- Add a timestamp window for robustness, especially if _eid isn't always perfectly consistent
    AND ABS(TIMESTAMP_DIFF(raw.event_timestamp, TIMESTAMP_MICROS(ga4.event_timestamp), SECOND)) < 60 -- Events within 60 seconds

-- Filter for a specific date range, e.g., the last 7 days
WHERE
    DATE(raw.event_timestamp) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE()
    AND PARSE_DATE('%Y%m%d', ga4._TABLE_SUFFIX) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE()
;

Explanation of the SQL:

CREATE OR REPLACE VIEW: Creates a logical view that acts like a table, but the data is computed each time the view is queried. For performance, you might CREATE OR REPLACE TABLE daily instead (Materialized View pattern).
your_gcp_project.raw_events_data_lake.raw_incoming_events: Your raw server-side event table.
your_gcp_project.analytics_YOUR_GA4_PROPERTY_ID.events_*: Your GA4 BigQuery export table. The * acts as a wildcard for daily tables.
LEFT JOIN: We start with all raw events and try to find matching GA4 events. This ensures we don't lose any raw event data, even if it didn't make it to GA4 for some reason.
ON raw.processed_event_id = (SELECT value.string_value FROM UNNEST(ga4.event_params) WHERE key = '_eid'): This is the critical join condition, linking events via their unique server-side generated ID.
AND ABS(TIMESTAMP_DIFF(raw.event_timestamp, TIMESTAMP_MICROS(ga4.event_timestamp), SECOND)) < 60: This adds a secondary check, ensuring that the raw event and the GA4 event occurred within a 60-second window. This makes the join more robust against _eid discrepancies or if an _eid is reused (though it shouldn't be for unique events). TIMESTAMP_MICROS is needed because ga4.event_timestamp is in microseconds.
JSON_EXTRACT_SCALAR(raw.payload, '$.page_location'): An example of how to extract specific fields from the raw_event_payload (which is stored as JSON or STRING in your raw data lake). This allows you to pull out details that might not be directly available in the GA4 export or are valuable for comparison.
UNNEST(ga4.event_params): Essential for querying nested event_params and user_properties in GA4 export.

4. Post-Processing: Custom Attribution & Reconciliation

With your unified_user_activity view or table, you can now:

Custom Attribution Modeling:

Track the unified_event_id through your pipeline.
Build custom attribution models that consider all touchpoints from your raw data (e.g., specific HTTP referrers, custom API calls) combined with GA4's user and session context.
Example: Attribute a conversion not just to the last GA4 medium, but to a specific raw_event_payload detail that was captured server-side but not sent to GA4.

-- Example: Multi-touch attribution focusing on raw referrer for first touch
SELECT
    unified_event_id,
    ga4_user_pseudo_id,
    MIN(raw_event_timestamp) OVER (PARTITION BY ga4_user_pseudo_id) AS first_touch_timestamp,
    FIRST_VALUE(raw_page_location) OVER (PARTITION BY ga4_user_pseudo_id ORDER BY raw_event_timestamp ASC) AS first_touch_page,
    FIRST_VALUE(JSON_EXTRACT_SCALAR(raw_event_payload, '$._request_headers.referer')) OVER (PARTITION BY ga4_user_pseudo_id ORDER BY raw_event_timestamp ASC) AS first_touch_referrer,
    -- And last touch attribution for purchase events
    LAST_VALUE(ga4_campaign) OVER (PARTITION BY unified_event_id ORDER BY ga4_event_timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_touch_campaign,
    LAST_VALUE(ga4_source) OVER (PARTITION BY unified_event_id ORDER BY ga4_event_timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_touch_source
FROM
    `your_gcp_project.unified_analytics.unified_user_activity`
WHERE
    ga4_event_name = 'purchase'
;

Data Reconciliation & Debugging:

Identify events that were captured raw but not sent to GA4 (where ga4_event_name is NULL). This could indicate GTM SC misconfigurations, consent blocks, or network issues to GA4.
Identify events sent to GA4 that have no raw counterpart (where raw_event_name is NULL). This might indicate client-side GA4 implementation not going through your GTM SC, or direct Measurement Protocol hits not captured by your raw logger.
Compare raw_event_name vs. ga4_event_name to see how GTM SC transformations affect event naming.
Compare raw_event_timestamp vs. ga4_event_timestamp for latency analysis.

-- Events in Raw Data Lake but not in GA4 Export
SELECT *
FROM `your_gcp_project.unified_analytics.unified_user_activity`
WHERE ga4_event_name IS NULL;

-- Events in GA4 Export but no matching raw data (less common with proper SST)
SELECT *
FROM `your_gcp_project.unified_analytics.unified_user_activity`
WHERE raw_event_name IS NULL;

-- Compare specific parameters
SELECT
    unified_event_id,
    JSON_EXTRACT_SCALAR(raw_event_payload, '$.value') AS raw_value,
    (SELECT value.double_value FROM UNNEST(ga4.event_params) WHERE key = 'value') AS ga4_value
FROM `your_gcp_project.unified_analytics.unified_user_activity`
WHERE raw_event_name = 'purchase' AND ga4_event_name = 'purchase'
AND JSON_EXTRACT_SCALAR(raw_event_payload, '$.value') != (SELECT CAST(value.double_value AS STRING) FROM UNNEST(ga4.event_params) WHERE key = 'value');

Deeper User Journey Analysis:
- Combine raw HTTP headers (e.g., User-Agent from _request_headers in raw_event_payload) with GA4's session and device data for enhanced bot detection or device segmentation.
- Trace the exact raw event payload leading up to a specific GA4 conversion.

Benefits of This Unification Approach

Holistic View of User Activity: Get the best of both worlds – the raw truth of your server-side data combined with GA4's powerful user and session context.
Enhanced Data Trust: Reconcile discrepancies and debug issues by comparing raw and processed data side-by-side, building confidence in your analytics.
Custom Attribution Beyond GA4: Build bespoke attribution models tailored to your business, incorporating any data point captured by your server-side pipeline.
Deeper Insights & Data Science: Fuel advanced analytics, machine learning models, and custom dashboards (e.g., in Looker Studio) with a richer, more complete dataset.
Future-Proofing: Decouple your core insights from any single vendor's processing by creating your own robust, unified data layer.
Improved Compliance & Auditability: The raw data serves as an immutable audit trail, now easily linked to user and session context.

Important Considerations

Cost: Running large JOIN queries in BigQuery can incur significant costs. Ensure your tables are properly partitioned and clustered, and use CREATE OR REPLACE TABLE for materialized views rather than querying the VIEW directly for frequently accessed data.
Data Volume: This approach is best for organizations with moderate to high data volumes where the value of detailed insights outweighs the costs.
Schema Evolution: Your raw data lake should be flexible (e.g., JSON type for payload) to handle schema changes. GA4's schema also evolves; your UNNEST queries need to be robust.
PII Handling: Be mindful that your raw data lake might contain PII if not pre-scrubbed. Ensure strict access controls and retention policies are in place, as your unified model could expose this PII alongside GA4's pseudo-anonymous IDs.
Latency: Building these unified tables or views is typically an overnight or hourly batch process, not real-time. The GA4 export itself has a delay.

Conclusion

Having both a raw server-side event data lake and a GA4 BigQuery export is a testament to a mature data engineering strategy. However, the true power is unleashed when these two datasets are intelligently unified. By leveraging BigQuery's SQL capabilities to join your raw GTM Server Container events with GA4's processed data, you gain unparalleled clarity into user behavior, reconcile data discrepancies, and build highly customized attribution models. This holistic approach elevates your analytics, transforms data trust, and empowers your business with a single, comprehensive source of truth for all digital interactions. Embrace this unification to unlock the full potential of your server-side data.

Unifying Your Server-Side Data: Joining Raw GTM SC Events with GA4 Export in BigQuery for Holistic Analytics