SDK Developer Guide
Overview
The Geodesix SDK is a software package that enables content suppliers and content consumers to integrate with Geodesix data services.
Currently, the SDK is intended for use by content suppliers to push data into the Geodesix Data Queue. Once ingested, this data is processed and later made available for search through the Geodesix RAG system.
Data processing is performed asynchronously by the backend processing pipeline. Using the SDK only places the data into the Data Queue; it does not trigger or guarantee immediate processing.
The processing stage depends on completing all onboarding phases with the customer. This does not pose an issue: as long as Geodesix R&D has verified that data is being pushed correctly, the customer may continue sending data to the queue, even if additional processing tasks, such as developing custom parsers, are still required.
Content suppliers can track the status of their data at each stage of the processing pipeline through the Customer/Content Management System (CMS) Portal.
*** The CMS portal is currently under development.
Web version of this document can be found here: https://geodesix.atlassian.net/wiki/spaces/gsxkw/pages/135299073/SDK+Developer+Guide+-+Python
Content Suppliers - Getting started
Prerequisites
Please complete the following two steps:
- Accessing the Python SDK code
- Getting the required SDK settings
Accessing the Python SDK Code
Access to the SDK code is provided through a private GitHub repository, available only to members of the Geodesix SDK Access team. To join this team, the supplier must appoint a developer and provide Geodesix with the developer's GitHub account ID or email address so an invitation can be issued.
- Appoint a Python developer to handle the SDK integration task.
- Ask the developer to send their GitHub ID or email address (if a GitHub account has not yet been created) to the Geodesix account manager.
- The developer should wait for the invitation to join the Geodesix SDK Access team and accept it once it arrives.
- After accepting the invitation, the developer will gain access to the SDK repository on GitHub:
git@github.com:geodesix-io/geodesix-sdk-python.git
Once access is granted, the developer can proceed with the installation process.
Getting the Required SDK Settings
Before you can use the SDK, Geodesix account-manager should provide you with two pieces of data:
- Repository name - this is the name of the folder in the queue where your data will be stored. Examples: "company1", "mybusiness" etc.
- AWS API credentials that include:
- Access Key (will be set later in
GSX_SDK_ACCESS_KEY_ID) - Secret key (will be set later in
GSX_SDK_SECRET_ACCESS_KEY)
- Access Key (will be set later in
Preparing your development environment
This guide assumes that the developer is familiar with Python programming, understands how to use venv and pip, and is working with a modern development environment.
- To use the SDK, you will need an IDE that supports Python development.
- In your IDE, set the Python interpreter to Python 3.11 or higher.
- Python 3.11 or higher must be installed on your machine.
- Git client must be installed.
Installation Procedure
The procedure described below assumes that:
- A developer joined the SDK Access team.
- SDK settings has been received.
- Python development environment is ready.
The Geodesix SDK is provided as a Python code project. To work with the SDK, you must download the code and place it in a directory within the project where you plan to implement the integration.
Downloading the SDK code
Use Git to download the code:
git clone git@github.com:geodesix-io/geodesix-sdk-python.git
Installing Dependencies
Create a virtual environment and install the dependencies. Geodesix SDK requires the AWS API for Python, which is installed by running pip with the project located at the SDK root.
cd geodesix-sdk-python
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
Setting up the SDK's AWS credentials
The most important rule when working with SDK credentials is to keep them private and never store them in a source-control repository (e.g., Git, SVN, VSS).
The Geodesix SDK expects AWS credentials to be defined either as environment variables or in a private Python configuration file. When the SDK loads, it searches for credentials in the following order:
- Environment variables, if they are defined
- Constants in
private/local_config.py, if environment variables are not found
For most developers getting started with the SDK, the simplest setup is to create private/local_config.py in the SDK root and place the required values there. For production or existing applications, environment variables or a secret manager are usually the better choice.
Below are two recommended approaches for defining credentials so the SDK can locate them correctly.
Approach 1: Store credentials in a secret manager and set environment variables
If you are integrating the Geodesix SDK into an existing project, you may already be using a secret-management service such as AWS Secrets Manager or Google Cloud Secret Manager. In that case, you can store the SDK credentials in your existing secret storage. Your application then needs to read the credentials from the secret manager and set the following environment variables:
- GSX_SDK_ACCESS_KEY_ID - set to the access key provided by Geodesix
- GSX_SDK_SECRET_ACCESS_KEY - set to the secret key provided by Geodesix
Here is a pseudo code for doing this:
import os
# Get the values from the secret-manager
access_key = secret_data["GSX_SDK_ACCESS_KEY_ID"]
secret_key = secret_data["GSX_SDK_SECRET_ACCESS_KEY"]
# Set two environment variables
os.environ["GSX_SDK_ACCESS_KEY_ID"] = access_key
os.environ["GSX_SDK_SECRET_ACCESS_KEY"] = secret_key
Approach 2: Define a private file that is not included in the code repository
- Navigate to the root directory of the SDK project.
- Create a directory named
privateif it does not already exist. - Copy
geodesix/samples/local_config.example.pytoprivate/local_config.py. - Ensure that this file is excluded from your version-control system.
- If you are using Git, the SDK's
.gitignorefile is already configured to ignore the entireprivate/directory. - The SDK automatically attempts to load
private/local_config.pyfrom the SDK root, so you do not need to manually import it in your application code.
Example of creating a directory and configuration file using the CLI:
cd geodesix-sdk-python
mkdir -p private
cp geodesix/samples/local_config.example.py private/local_config.py
Code for private/local_config.py
GSX_SDK_ACCESS_KEY_ID = "[REPLACE WITH ACCESS KEY]"
GSX_SDK_SECRET_ACCESS_KEY = "[REPLACE WITH SECRET KEY]"
REPOSITORY_NAME = "[REPLACE WITH REPOSITORY NAME]"
Check the Connection with the AWS S3 Service
- Navigate to the root directory of the SDK project.
- Enter the
code_exampledirectory. - Open the file
Check_Access.pyin your IDE. - Set the repository name you received from Geodesix. You can do this either in
private/local_config.pyusingREPOSITORY_NAME, or by setting an environment variable with the same name. - Run the example using either your IDE or the command line.
For local onboarding, the easiest procedure is usually:
- Create
private/local_config.pyfromgeodesix/samples/local_config.example.py - Set
REPOSITORY_NAME,GSX_SDK_ACCESS_KEY_ID, andGSX_SDK_SECRET_ACCESS_KEY - Run
python3 code_example/Check_Access.pyfrom the SDK root
If everything is configured correctly, the program will output "CONNECTION OK".
If any configuration is missing or incorrect, the program will output "CONNECTION PROBLEM" followed by an error trace.
Example running with CLI:
python3 code_example/Check_Access.py
CONNECTION OK
Example code after setting the repository name:
from geodesix.Geodesix import Geodesix
from geodesix.core.SDKUtils import SDKUtils
try:
# SET HERE THE REPOSITORY NAME YOR RECEIVED FROM GEODESIX
repository = SDKUtils.get_constant("REPOSITORY_NAME", True)
# Create bare client and test the connection with AWS service.
gsxClient = Geodesix.create_basic_data_queue_client(repository)
gsxClient.test_aws_connection()
except Exception as e:
print(e, end="")
Push HTML Data
In this section and the following sections, we will show you with few code examples for pushing HTML data. The examples in this section assumes that all the preparation steps described by the previous chapters has been made.
Most of the code is self explanatory, and yet we will highlight the important issues to note.
Source code is in file: [sdk-root]/code_example/Example_PushHtml.py
import json
from geodesix.Consts import Consts
from geodesix.Geodesix import Geodesix
from geodesix.core.SDKUtils import SDKUtils
try:
# SET HERE THE REPOSITORY NAME YOR RECEIVED FROM GEODESIX
repository = SDKUtils.get_constant("REPOSITORY_NAME", True)
# Given the web page URL
url = "https://example.com/story14/about/product"
# Given the HTML string (full html) of a page
html = f"<html><body>Example with clean url '{url}'</body></html>"
# Creating the SDK client for pushing data to Geodesix queue
gsxClient = Geodesix.create_basic_data_queue_client(repository)
# Write the page data
uid = gsxClient.push_html(url, html)
# Read the data by uid of stored page (!for test only purpose!)
data = gsxClient.read_item_by_uid(uid, "example.com", Consts.DATA_TYPE_HTML)
print("\nData that was written:", end="")
print("\n" + json.dumps(data, indent=2), end="")
# Read page data by page's url (for test only purpose)
data = gsxClient.read_html_by_url(url, Consts.DATA_TYPE_HTML)
print("\nRead result:", end="")
print("\n" + json.dumps(data, indent=2), end="")
except Exception as e:
print(e, end="")
Highlights for this example:
- To use the SDK, import the required classes from the geodesix package. Additional modules may appear in this package; some are intended for other features or internal system use.
- The SDK client used for pushing data is initialized with the repository name provided by Geodesix. Your account credentials grant access only to the queue folder within this repository. Using any other repository name will cause the SDK to fail.
- The HTML and URL values must be supplied by your CMS. The SDK does not scrape content; it expects raw data provided directly by the supplier.
- URL values must represent the clean, exact path of the web resource, without additional query parameters. The SDK does not use the URL verbatim; instead, it parses the URL and reconstructs a normalized URI containing only the required components. As a best practice, provide clean URLs from the start.
- The HTML value must contain valid and full HTML syntax. Geodesix parses the HTML to extract meaningful text blocks, titles, descriptions, summaries, and more. You may remove unused scripts or style elements to reduce file size and minimize the amount of data processed by Geodesix.
- The read section in the code examples are not required for real integrations; it is included only for testing purposes.
- Since you will typically process multiple files, there is no need to recreate the client each time. Create the client once and reuse it for all subsequent operations.
Here is a simplified version of the code, with comments and the read section removed:
from geodesix.Geodesix import Geodesix
url = "https://example.com/story14/about/product"
html = f"<html><body>Example with clean url '{url}'</body></html>"
gsxClient = Geodesix.create_basic_data_queue_client("[repository name]")
gsxClient.push_html(url, html)
Push HTML Data when the resource is identified by query parameter
In the previous example, the URL identified the web page resource using only the URL path. In this example, you will slightly modify the code to support a resource that is identified by a query parameter named id.
Providing the exact URL is critical because Geodesix uses it to generate the universal unique identifier (UUID) for the content item in the system.
Later, it is recommended to review the examples in code_example/Example_TestUrl.py, which demonstrate how Geodesix cleans and normalizes URLs (when necessary) to retain only the parts that uniquely identify the resource.
Pay attention to the code in the URL assignment, the push_html(...) call, and the corresponding read operation.
In the URL assignment, the URL uses a query parameter to identify the resource:
url = "https://example.com/story14/about/product?id=19"
In the push_html(...) and read_html_by_url(...) calls, the fourth/third argument is used to specify the query parameter that identifies the resource ID.
Source code can be found in file: [sdk-root]/code_example/Example_PushHtmlQuery.py
import json
from geodesix.Consts import Consts
from geodesix.Geodesix import Geodesix
from geodesix.core.SDKUtils import SDKUtils
try:
# SET HERE THE REPOSITORY NAME YOR RECEIVED FROM GEODESIX
repository = SDKUtils.get_constant("REPOSITORY_NAME", True)
# Creating the SDK client for pushing data to Geodesix queue
gsxClient = Geodesix.create_basic_data_queue_client(repository)
# Given the web page URL. NOTE THE QUERY STRING PART id=19.
# In this example 'id' is a query parameter that define the resource.
url = "https://example.com/story14/about/product?id=19"
# Given the HTML string (full html) of a page
html = f"<html><body>Example with query parameter 'id' in url: '{url}' </body></html>"
# Write the page data
#
# NOTE THE VALUE OF queryParamId. It gets the name of query parameter that identify the resource.
uid = gsxClient.push_html(url, html, None, "id")
# Read the data by uid of stored page (!for test only purpose!)
data = gsxClient.read_item_by_uid(uid, "example.com", Consts.DATA_TYPE_HTML)
print("\nData that was written:", end="")
print("\n" + json.dumps(data, indent=2), end="")
# Read page data by page's url (for test only purpose)
#
# NOTE THE VALUE OF queryParamId. We use it here too.
data = gsxClient.read_html_by_url(url, Consts.DATA_TYPE_HTML, "id")
print("\nRead result:", end="")
print("\n" + json.dumps(data, indent=2), end="")
except Exception as e:
print(e, end="")
Summary
This initial version of the SDK includes the essential components needed to begin pushing data into Geodesix. Additional examples and capabilities will be added in future releases.