Skip to main content

SDK Developer Guide

Overview

The Geodesix SDK is a software package that enables content suppliers and content consumers to integrate with Geodesix data services.

Currently, the SDK is intended for use by content suppliers to push data into the Geodesix Data Queue. Once ingested, this data is processed and later made available for search through the Geodesix RAG system.

Data processing is performed asynchronously by the backend processing pipeline. Using the SDK only places the data into the Data Queue; it does not trigger or guarantee immediate processing.

The processing stage depends on completing all onboarding phases with the customer. This does not pose an issue: as long as Geodesix R&D has verified that data is being pushed correctly, the customer may continue sending data to the queue, even if additional processing tasks, such as developing custom parsers, are still required.

Content suppliers can track the status of their data at each stage of the processing pipeline through the Customer/Content Management System (CMS) Portal.

*** The CMS portal is currently under development.

Web version of this document can be found here: https://geodesix.atlassian.net/wiki/spaces/gsxkw/pages/135299073/SDK+Developer+Guide+-+Python

Content Suppliers - Getting started

Prerequisites

Please complete the following two steps:

  • Accessing the Python SDK code
  • Getting the required SDK settings

Accessing the Python SDK Code

Access to the SDK code is provided through a private GitHub repository, available only to members of the Geodesix SDK Access team. To join this team, the supplier must appoint a developer and provide Geodesix with the developer's GitHub account ID or email address so an invitation can be issued.

  1. Appoint a Python developer to handle the SDK integration task.
  2. Ask the developer to send their GitHub ID or email address (if a GitHub account has not yet been created) to the Geodesix account manager.
  3. The developer should wait for the invitation to join the Geodesix SDK Access team and accept it once it arrives.
  4. After accepting the invitation, the developer will gain access to the SDK repository on GitHub: git@github.com:geodesix-io/geodesix-sdk-python.git

Once access is granted, the developer can proceed with the installation process.

Getting the Required SDK Settings

Before you can use the SDK, Geodesix account-manager should provide you with two pieces of data:

  • Repository name - this is the name of the folder in the queue where your data will be stored. Examples: "company1", "mybusiness" etc.
  • AWS API credentials that include:
    • Access Key (will be set later in GSX_SDK_ACCESS_KEY_ID)
    • Secret key (will be set later in GSX_SDK_SECRET_ACCESS_KEY)

Preparing your development environment

This guide assumes that the developer is familiar with Python programming, understands how to use venv and pip, and is working with a modern development environment.

  1. To use the SDK, you will need an IDE that supports Python development.
  2. In your IDE, set the Python interpreter to Python 3.11 or higher.
  3. Python 3.11 or higher must be installed on your machine.
  4. Git client must be installed.

Installation Procedure

The procedure described below assumes that:

  • A developer joined the SDK Access team.
  • SDK settings has been received.
  • Python development environment is ready.

The Geodesix SDK is provided as a Python code project. To work with the SDK, you must download the code and place it in a directory within the project where you plan to implement the integration.

Downloading the SDK code

Use Git to download the code:

git clone git@github.com:geodesix-io/geodesix-sdk-python.git

Installing Dependencies

Create a virtual environment and install the dependencies. Geodesix SDK requires the AWS API for Python, which is installed by running pip with the project located at the SDK root.

cd geodesix-sdk-python
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Setting up the SDK's AWS credentials

The most important rule when working with SDK credentials is to keep them private and never store them in a source-control repository (e.g., Git, SVN, VSS).

The Geodesix SDK expects AWS credentials to be defined either as environment variables or in a private Python configuration file. When the SDK loads, it searches for credentials in the following order:

  1. Environment variables, if they are defined
  2. Constants in private/local_config.py, if environment variables are not found

For most developers getting started with the SDK, the simplest setup is to create private/local_config.py in the SDK root and place the required values there. For production or existing applications, environment variables or a secret manager are usually the better choice.

Below are two recommended approaches for defining credentials so the SDK can locate them correctly.

Approach 1: Store credentials in a secret manager and set environment variables

If you are integrating the Geodesix SDK into an existing project, you may already be using a secret-management service such as AWS Secrets Manager or Google Cloud Secret Manager. In that case, you can store the SDK credentials in your existing secret storage. Your application then needs to read the credentials from the secret manager and set the following environment variables:

  • GSX_SDK_ACCESS_KEY_ID - set to the access key provided by Geodesix
  • GSX_SDK_SECRET_ACCESS_KEY - set to the secret key provided by Geodesix

Here is a pseudo code for doing this:

import os

# Get the values from the secret-manager
access_key = secret_data["GSX_SDK_ACCESS_KEY_ID"]
secret_key = secret_data["GSX_SDK_SECRET_ACCESS_KEY"]

# Set two environment variables
os.environ["GSX_SDK_ACCESS_KEY_ID"] = access_key
os.environ["GSX_SDK_SECRET_ACCESS_KEY"] = secret_key

Approach 2: Define a private file that is not included in the code repository

  1. Navigate to the root directory of the SDK project.
  2. Create a directory named private if it does not already exist.
  3. Copy geodesix/samples/local_config.example.py to private/local_config.py.
  4. Ensure that this file is excluded from your version-control system.
  5. If you are using Git, the SDK's .gitignore file is already configured to ignore the entire private/ directory.
  6. The SDK automatically attempts to load private/local_config.py from the SDK root, so you do not need to manually import it in your application code.

Example of creating a directory and configuration file using the CLI:

cd geodesix-sdk-python
mkdir -p private
cp geodesix/samples/local_config.example.py private/local_config.py

Code for private/local_config.py

GSX_SDK_ACCESS_KEY_ID = "[REPLACE WITH ACCESS KEY]"
GSX_SDK_SECRET_ACCESS_KEY = "[REPLACE WITH SECRET KEY]"
REPOSITORY_NAME = "[REPLACE WITH REPOSITORY NAME]"

Check the Connection with the AWS S3 Service

  1. Navigate to the root directory of the SDK project.
  2. Enter the code_example directory.
  3. Open the file Check_Access.py in your IDE.
  4. Set the repository name you received from Geodesix. You can do this either in private/local_config.py using REPOSITORY_NAME, or by setting an environment variable with the same name.
  5. Run the example using either your IDE or the command line.

For local onboarding, the easiest procedure is usually:

  1. Create private/local_config.py from geodesix/samples/local_config.example.py
  2. Set REPOSITORY_NAME, GSX_SDK_ACCESS_KEY_ID, and GSX_SDK_SECRET_ACCESS_KEY
  3. Run python3 code_example/Check_Access.py from the SDK root

If everything is configured correctly, the program will output "CONNECTION OK".

If any configuration is missing or incorrect, the program will output "CONNECTION PROBLEM" followed by an error trace.

Example running with CLI:

python3 code_example/Check_Access.py
CONNECTION OK

Example code after setting the repository name:

from geodesix.Geodesix import Geodesix
from geodesix.core.SDKUtils import SDKUtils

try:
# SET HERE THE REPOSITORY NAME YOR RECEIVED FROM GEODESIX
repository = SDKUtils.get_constant("REPOSITORY_NAME", True)

# Create bare client and test the connection with AWS service.
gsxClient = Geodesix.create_basic_data_queue_client(repository)
gsxClient.test_aws_connection()
except Exception as e:
print(e, end="")

Push HTML Data

In this section and the following sections, we will show you with few code examples for pushing HTML data. The examples in this section assumes that all the preparation steps described by the previous chapters has been made.

Most of the code is self explanatory, and yet we will highlight the important issues to note.

Source code is in file: [sdk-root]/code_example/Example_PushHtml.py

import json

from geodesix.Consts import Consts
from geodesix.Geodesix import Geodesix
from geodesix.core.SDKUtils import SDKUtils

try:
# SET HERE THE REPOSITORY NAME YOR RECEIVED FROM GEODESIX
repository = SDKUtils.get_constant("REPOSITORY_NAME", True)

# Given the web page URL
url = "https://example.com/story14/about/product"

# Given the HTML string (full html) of a page
html = f"<html><body>Example with clean url '{url}'</body></html>"

# Creating the SDK client for pushing data to Geodesix queue
gsxClient = Geodesix.create_basic_data_queue_client(repository)

# Write the page data
uid = gsxClient.push_html(url, html)

# Read the data by uid of stored page (!for test only purpose!)
data = gsxClient.read_item_by_uid(uid, "example.com", Consts.DATA_TYPE_HTML)
print("\nData that was written:", end="")
print("\n" + json.dumps(data, indent=2), end="")

# Read page data by page's url (for test only purpose)
data = gsxClient.read_html_by_url(url, Consts.DATA_TYPE_HTML)
print("\nRead result:", end="")
print("\n" + json.dumps(data, indent=2), end="")
except Exception as e:
print(e, end="")

Highlights for this example:

  • To use the SDK, import the required classes from the geodesix package. Additional modules may appear in this package; some are intended for other features or internal system use.
  • The SDK client used for pushing data is initialized with the repository name provided by Geodesix. Your account credentials grant access only to the queue folder within this repository. Using any other repository name will cause the SDK to fail.
  • The HTML and URL values must be supplied by your CMS. The SDK does not scrape content; it expects raw data provided directly by the supplier.
  • URL values must represent the clean, exact path of the web resource, without additional query parameters. The SDK does not use the URL verbatim; instead, it parses the URL and reconstructs a normalized URI containing only the required components. As a best practice, provide clean URLs from the start.
  • The HTML value must contain valid and full HTML syntax. Geodesix parses the HTML to extract meaningful text blocks, titles, descriptions, summaries, and more. You may remove unused scripts or style elements to reduce file size and minimize the amount of data processed by Geodesix.
  • The read section in the code examples are not required for real integrations; it is included only for testing purposes.
  • Since you will typically process multiple files, there is no need to recreate the client each time. Create the client once and reuse it for all subsequent operations.

Here is a simplified version of the code, with comments and the read section removed:

from geodesix.Geodesix import Geodesix

url = "https://example.com/story14/about/product"
html = f"<html><body>Example with clean url '{url}'</body></html>"
gsxClient = Geodesix.create_basic_data_queue_client("[repository name]")
gsxClient.push_html(url, html)

Push HTML Data when the resource is identified by query parameter

In the previous example, the URL identified the web page resource using only the URL path. In this example, you will slightly modify the code to support a resource that is identified by a query parameter named id.

Providing the exact URL is critical because Geodesix uses it to generate the universal unique identifier (UUID) for the content item in the system.

Later, it is recommended to review the examples in code_example/Example_TestUrl.py, which demonstrate how Geodesix cleans and normalizes URLs (when necessary) to retain only the parts that uniquely identify the resource.

Pay attention to the code in the URL assignment, the push_html(...) call, and the corresponding read operation.

In the URL assignment, the URL uses a query parameter to identify the resource:

url = "https://example.com/story14/about/product?id=19"

In the push_html(...) and read_html_by_url(...) calls, the fourth/third argument is used to specify the query parameter that identifies the resource ID.

Source code can be found in file: [sdk-root]/code_example/Example_PushHtmlQuery.py

import json

from geodesix.Consts import Consts
from geodesix.Geodesix import Geodesix
from geodesix.core.SDKUtils import SDKUtils

try:
# SET HERE THE REPOSITORY NAME YOR RECEIVED FROM GEODESIX
repository = SDKUtils.get_constant("REPOSITORY_NAME", True)

# Creating the SDK client for pushing data to Geodesix queue
gsxClient = Geodesix.create_basic_data_queue_client(repository)

# Given the web page URL. NOTE THE QUERY STRING PART id=19.
# In this example 'id' is a query parameter that define the resource.
url = "https://example.com/story14/about/product?id=19"

# Given the HTML string (full html) of a page
html = f"<html><body>Example with query parameter 'id' in url: '{url}' </body></html>"

# Write the page data
#
# NOTE THE VALUE OF queryParamId. It gets the name of query parameter that identify the resource.
uid = gsxClient.push_html(url, html, None, "id")

# Read the data by uid of stored page (!for test only purpose!)
data = gsxClient.read_item_by_uid(uid, "example.com", Consts.DATA_TYPE_HTML)
print("\nData that was written:", end="")
print("\n" + json.dumps(data, indent=2), end="")

# Read page data by page's url (for test only purpose)
#
# NOTE THE VALUE OF queryParamId. We use it here too.
data = gsxClient.read_html_by_url(url, Consts.DATA_TYPE_HTML, "id")
print("\nRead result:", end="")
print("\n" + json.dumps(data, indent=2), end="")
except Exception as e:
print(e, end="")

Summary

This initial version of the SDK includes the essential components needed to begin pushing data into Geodesix. Additional examples and capabilities will be added in future releases.