NOMN

Speculative Audio Tools
← Tools

NOMN: Temporal Fine Structure Enhancer

Late 2026 / Patent Pending
FAQ
What is NOMN actually doing to audio?
NOMN adds the temporal microstructure that natural acoustic sources have and that digital playback doesn't. A violinist's bowing, a vocalist's phrasing, a drummer's microtiming, the mechanical drift of any physical instrument: these produce small, structured variations in event timing that the auditory system has co-evolved with for hundreds of thousands of years. The variations enter the auditory system as part of what neuroscience calls **temporal fine structure**, the sub-millisecond waveform information the cochlea passes to the auditory nerve, which the brain uses for pitch perception, source identification, and the perception of naturalness (see TFS questions below for more resolution).

Digital playback runs on a crystal-locked clock whose timing stability is orders of magnitude tighter than any natural acoustic source. Crystals have measurable phase noise and jitter. We're not claiming they don't, but those deviations are vanishingly small and statistically structureless compared to the rich temporal variation any physical sound source produces. There has never, in the natural history of hearing, been a sound source so temporally rigid.

NOMN introduces the kind of variation that natural sources have and grid-locked playback doesn't. Not as random noise, not as a recognizable effect, but as structured temporal patterning that the auditory system reads as natural rather than mechanical.
Isn't this just an advanced tremolo or a fancy chorus?
At the DSP-primitive level, timing modulation has been around forever. Tape wow and flutter has been doing this to analog audio since the 1930s. Granular synthesis has been operating at sub-10ms scales since the 1970s. Every chorus plugin since the 1990s has been doing sub-sample-resolution time-varying modulation. We're not claiming to have invented modulation :)

What's new is what's driving it.

A tremolo's control signal is a 2-parameter LFO. A chorus is a 4-6 parameter LFO. A humanizer plugin is filtered random noise. Tape emulation is noise shaped to match measured wow/flutter spectra from vintage gear. All of these are content-blind, and none of them are modeled from the body. They're modeled from nostalgia for vintage gear.

NOMN's modulation is content-adaptive and statistically matched to natural source variation. You don't get that out of an LFO no matter how cleverly you tweak its parameters. The right analogy isn't "advanced tremolo." It's the difference between a sine wave oscillator and a sampled instrument. Both produce periodic audio. One sounds like a synth and one sounds like a violin, because the signal driving them encodes vastly different amounts of natural-source structure. Same primitive, fundamentally different signal.
Music cognition research says the smallest perceptible timing difference is something like 10-50ms. Doesn't that mean NOMN's microsecond-scale modulation is below audibility / "Just Noticeable Difference" (JND) threshold and thus handwavy audiophile nonsense like wildly expensive speaker cable or something?
This is the most common version of the audibility critique, and it gets the question backwards.

First, on what the JND literature actually measures. JND (just-noticeable-difference) thresholds for musical timing, the ones in the 10-50ms range, measure how much one note has to move relative to another before a listener can consciously identify the shift in a forced-choice cognitive task. That tells you when timing becomes *labelable* as different. It does not tell you the resolution at which the auditory system processes time or what we sense.

The auditory system's actual temporal resolution is roughly three to four orders of magnitude finer than musical JND. The two most established lines of evidence:

The binaural pathway resolves interaural time differences down to about 10 microseconds. Klumpp & Eady (1956, J. Acoust. Soc. Am. 28: 859-860) measured average ITD discrimination thresholds of 9μs for band-limited noise and 11μs for a 1000-Hz tone across ten listeners. These thresholds have been independently reproduced for nearly seventy years. Brughera, Dunai & Hartmann (2013, J. Acoust. Soc. Am. 133: 2839-2855) confirmed thresholds just above 10μs at 700-1000 Hz using modern methods. The lowest measured thresholds approach the single-microsecond range under optimal conditions. The mechanism is well-understood: neurons in the medial superior olive perform coincidence detection on phase-locked spikes from each ear. The largest ITD anyone normally encounters, for a sound directly to one side, is around 600-700μs, set by the distance between the ears (Mills 1958, J. Acoust. Soc. Am. 30: 237-246). Listeners reliably resolve angular differences of about 1 degree near the midline. Note that most of this research is already 70+ years old!

The monaural pathway encodes the sub-millisecond structure of sounds through what auditory neuroscience calls **temporal fine structure (TFS)**, the rapid waveform oscillations within each cochlear frequency band, as distinct from the slower envelope (ENV) modulations superimposed on them (Moore 2008, J. Assoc. Res. Otolaryngol. 9: 399-406, the canonical review). TFS information is carried in the timing of auditory-nerve-fiber spikes that phase-lock to individual cycles of the stimulus waveform for low-frequency components up to several kilohertz. This isn't a hypothesis or a contested claim, it is the standard model of how the auditory periphery encodes time, reviewed comprehensively in Joris, Schreiner & Rees (2004, Physiological Reviews 84: 541-577).

TFS is what the auditory system uses for pitch perception of complex tones, for the perception of speech in fluctuating background noise, and for source separation in complex acoustic environments. Smith, Delgutte & Oxenham (2002, Nature 416: 87-90) demonstrated this directly by constructing "chimaeric" sounds in which the envelope of one signal was combined with the TFS of another. Listeners reliably perceived pitch and source location based on the TFS, not the envelope. TFS isn't specific to live sound, binaural listening, or any particular playback situation. It operates on whatever the cochlea receives, including the output of headphones and speakers playing recorded music. When you listen to a recording, the temporal fine structure of the audio is encoded into the spike timing of your auditory nerve at sub-millisecond resolution. This processing happens continuously, below the threshold of conscious awareness, which is exactly why musical JND studies don't measure it. JND measures what listeners can report. It doesn't measure what their auditory systems are doing.

The more important point. **The right question isn't whether listeners can A/B-distinguish two audio files in a controlled trial. The right question is whether the technology that generates audio for human consumption should operate at the resolution of the sensory system it's serving.**

The audio industry has answered this question consistently for decades. Studios record at 96kHz or 192kHz not because listeners can reliably A/B-distinguish those rates from 48kHz on every track, but because the production chain shouldn't have artifacts introduced at the resolution end of the system. Mastering engineers obsess over jitter specifications in word clocks that operate well below classical audibility thresholds, because they don't want the clock to be the bottleneck. Professional audio interfaces compete on sub-millisecond round-trip latency. The principle is consistent: human-facing audio technology should operate above the sensory floor, not below it.

NOMN sits in this lineage. Crystal-locked playback timing is acoustically unprecedented in the natural history of hearing. There has never been a sound source with this little temporal variation. The question isn't whether listeners can articulate the difference in a forced-choice test on a per-track basis. The question is whether AI-generated audio at scale, intended for billions of hours of human listening, should match the temporal resolution the sensory system actually uses. We think it should. The audio industry has historically agreed with that principle for every other dimension of the playback chain: sample rate, bit depth, jitter, latency, frequency response, distortion. Treating the temporal microstructure dimension as the lone exception, just because the relevant variation sits below conscious labelling threshold, is inconsistent.

If the audibility critique held, if anything below conscious JND were perceptually irrelevant, listeners couldn't localize sound sources, couldn't separate voices in a crowd, couldn't tell a real violin from a sampled violin played through the same speaker. All of those judgments depend on temporal resolution far finer than musical JND.
OK so this is all pretty interesting, but what's temporal fine structure, exactly, and where does NOMN sit relative to the established TFS literature?
Temporal fine structure (TFS) is the standard technical term in auditory neuroscience for the rapid sub-millisecond waveform information the cochlea passes to the auditory nerve, as distinct from the slower envelope (ENV) information riding on top of it. The cochlea decomposes broadband sound into narrowband signals via auditory filtering, and each of those narrowband signals can be characterized as a slow-varying envelope superimposed on a faster carrier: the fine structure. Both kinds of information are encoded in the timing of auditory-nerve spikes, but they're carried differently. ENV through changes in firing rate, TFS through phase-locking to individual cycles of the waveform.

The TFS framework has been extensively developed in the auditory science literature over the past two decades. Moore (2008, J. Assoc. Res. Otolaryngol. 9: 399-406) is the standard review of TFS's role in pitch perception, masking, and speech perception. Smith, Delgutte & Oxenham (2002, Nature 416: 87-90) used "chimaeric" sounds, constructed by combining the envelope of one signal with the TFS of another, to demonstrate that listeners rely on TFS for pitch and source localization while relying on ENV for speech recognition in quiet. Subsequent work (Lorenzi et al. 2006, PNAS 103: 18866-18869; Hopkins & Moore 2009, J. Acoust. Soc. Am. 125: 442-446) has shown that TFS sensitivity is critical for speech perception in noisy environments, and that hearing-impaired listeners' reduced sensitivity to TFS is a major factor in their difficulty understanding speech in noise.

This matters for NOMN in two ways.

First, TFS is the established technical vocabulary for what NOMN operates on. The temporal microstructure NOMN introduces is, in the technical language of the field, modulation of the temporal fine structure of the audio signal. We aren't making up a new perceptual category. We're operating in a well-mapped region of the auditory science literature.

Second, the existing TFS research focuses primarily on what's *lost*. How hearing-impaired listeners lose TFS sensitivity, how cochlear implants struggle to deliver TFS information, how aging degrades TFS processing. NOMN approaches the question from the other direction: what kind of TFS structure should well-engineered playback technology present to listeners whose TFS processing is intact? The auditory science community has spent two decades documenting how much TFS matters for normal hearing. The audio industry has not yet drawn the corresponding conclusion about playback technology design. NOMN is one application of that conclusion.

A note on scope. The "fine structure" in TFS refers to the rapid carrier oscillation within auditory filter bands, which is encoded at sub-millisecond resolution via phase-locking up to several kilohertz. NOMN's modulation operates across a range from microsecond to millisecond scales, modulating the temporal structure of the audio content itself. Both sit in the temporal regime where the auditory system does fine-grained timing work. We use the broader phrase "temporal microstructure" in marketing copy to avoid claiming we directly manipulate the specific signal-processing quantity that TFS researchers technically measure with the Hilbert decomposition, but the perceptual mechanism we're targeting is the same one that TFS research has been documenting since the early 2000s.

A note on what we are not claiming. We are not claiming that digital audio is missing temporal fine structure, or that NOMN restores something the format lost. A PCM recording carries fine structure for the in-band, adequately-resolved content of the signal. NOMN's claim is narrower and different: clock-locked playback presents whatever fine structure is there with perfect temporal stationarity, a stationarity no natural acoustic source has. NOMN introduces structured time-variation into the playback. It is agnostic to how much fine-structure detail the source file contains, because it modulates the temporal behavior of the signal rather than adding detail back to it.
If sub-JND timing differences don't matter, why does the audio industry spend so much effort minimizing latency?
It doesn't, and this is the cleanest illustration of the human:machine framing we just made.

Every working musician who records with a DAW tunes their audio buffer size to keep round-trip latency as low as possible. Professional audio interfaces compete on sub-millisecond round-trip latency. The Bela platform was specifically built to achieve sub-millisecond action-to-sound latency for digital musical instruments (McPherson, Jack & Moro 2016, Proc. NIME) because most common platforms fail to meet the targets professional musicians need.

The peer-reviewed evidence on what musicians actually feel is clear. Jack, Mehrabi, Stockman & McPherson (2018, Music Perception 36: 109-128) tested professional percussionists and amateur musicians on a digital percussion instrument with controlled latency conditions of 0ms, 10ms, 10ms ± 3ms jitter, and 20ms. Both groups rated zero-latency as significantly higher quality than the 10ms-with-jitter and 20ms conditions. Professional percussionists were more sensitive to latency than amateurs and showed measurable changes in timing performance under added latency. Schmid et al. (2024, Proc. Mensch und Computer, ACM) measured just-noticeable-difference for added audio latency across 37 listeners and found a mean JND of 27ms at 64ms base latency, with musically sophisticated participants reliably detecting smaller margins. Earlier ensemble work documented that asynchronies up to 50ms occur in real performances (Rasch 1979, Acustica 43: 121-131) and that professional percussionists exhibit timing jitter of 10-40ms even when synchronizing to a metronome (Dahl 2011, Music Perception 28: 491-503).

Acoustic drums have a natural latency of about 2-3ms from stick contact to sound reaching the drummer's ears, a value set by the speed of sound across the distance from the drum to the head. This is the baseline the drummer's nervous system has calibrated to over years of practice. When an electronic drum module introduces an extra 5-10ms on top of this, professional drummers describe the kit as "sluggish," "disconnected," "laggy."

Notice what's happening here. The audio industry has, for decades, accepted the principle that **playback technology should operate at the temporal resolution the sensory system actually uses, not at the resolution of conscious A/B detection**. Nobody argues that audio interfaces should target 50ms latency because that's the conscious JND. The industry targets sub-millisecond because that's where the human:machine interaction breaks down. Studios record at high sample rates so that the production chain isn't the bottleneck. Word clocks are spec'd at jitter levels below classical audibility for the same reason. You don't want the clock to be the lowest-resolution element in the system.

This is exactly the principle NOMN applies. Crystal-locked playback has temporal stability orders of magnitude tighter than any natural acoustic source. The sensory system that consumes the audio resolves timing at microsecond scales. The fact that listeners can't always consciously label what they're hearing in an A/B test doesn't mean the technology should operate below the sensory floor. It means the audio industry should treat temporal microstructure with the same engineering discipline it already applies to sample rate, bit depth, latency, and jitter.
But the speaker cone and the room introduce way more temporal modification than NOMN does. Doesn't that swamp the effect?
In absolute time-magnitude terms, yes. A room's impulse response operates at the millisecond-to-hundreds-of-milliseconds scale. Speaker cone breakup happens at sub-millisecond scales. The acoustic chain introduces more temporal modification than NOMN does.

The relevant difference isn't magnitude. It's structure.

Room and speaker convolution is content-blind and stationary. The room's impulse response is fixed for a given listening position. The reverb tail of a snare hit and the reverb tail of a sustained vocal note get the same room treatment. This is convolution with a fixed kernel, large in magnitude, but content-blind and time-invariant.

The auditory system has well-documented machinery for separating direct-path source signals from reverberant reflections. The foundational finding is the precedence effect, first systematically described by Wallach, Newman & Rosenzweig (1949, American Journal of Psychology 62: 315-336). When two identical sounds arrive at the ears within a few milliseconds of each other, the listener perceives a single fused sound localized at the position of the first-arriving wavefront, with the later-arriving reflections strongly suppressed in their contribution to perceived location. This is why you can localize a speaker in a reverberant room. The brain attributes the spatial cue to the direct sound and treats the reflections as environment. The mechanism extends into the broader framework of auditory scene analysis (Bregman, 1990, MIT Press), in which the auditory system uses primitive grouping cues to organize incoming sound into source representations distinct from environmental context. Subsequent reviews (Litovsky et al. 1999, J. Acoust. Soc. Am. 106: 1633-1654; Brown et al. 2015, J. Acoust. Soc. Am. 137: 776-790) document this is a continuous, automatic process operating below conscious awareness.

What the auditory system *can't* factor out, and uses heavily for source identification and naturalness judgment, is the underlying source's intrinsic timing structure. The room can smear what's there. It can't add what isn't, and it can't subtract what is.

Put simply: a real violin and a sampled violin played through the same speaker in the same room are typically distinguished by listeners on extended listening. The acoustic chain is identical. The difference is in source-level temporal structure that survives the chain because it's encoded in the signal before it ever reaches the speaker.
Doesn't the DAC's reconstruction filter smooth out fast timing modulation anyway?
No, and the reason matters for what we are doing here: NOMN's modulation isn't a separate timing channel for the DAC to filter out. The modulation is encoded in the audio content itself, in which samples contain what energy. The DAC sees a normal audio signal at its native sample rate and applies its usual reconstruction. Whatever filtering the DAC does to the audio, it does identically to NOMN-processed audio and unprocessed audio. The modulation is preserved because it's a property of the content, not metadata that the filter could destroy.

A general principle worth stating clearly: NOMN's modulation is content, not metadata. Anything that processes the audio processes the modulation along with it. Anything that doesn't process the audio can't touch the modulation. There's no separate channel to attack. The same logic applies to the speaker, the room, the listener's HRTF, the ear canal. All linear time-invariant operations applied to the audio content, none of which selectively erase the modulation.
Couldn't you accomplish the same thing with a low-depth chorus or some filtered noise driving varispeed?
You could accomplish *some* of it. Audio engineers have known for decades that adding subtle temporal variation makes digital audio sound less mechanical. Tape emulation plugins, subtle chorus, and pitch modulation are all in this lineage. We're not denying that any structured temporal variation is better than none.

The difference is in what the auditory system does with different kinds of variation. LFO-driven modulation is periodic, and the auditory system detects periodicity below conscious awareness. Subtle periodic modulation reads as "wobbly" or "effected" even when listeners can't say why. Filtered noise modulation is aperiodic but content-blind, which the auditory system also reads as foreign to natural sources, since natural sources don't produce statistically white timing variation. Natural timing variation has specific structure: long-range correlations and content correlation that have been measured directly in human performance. Hennig (2014, PNAS 111: 12974-12979) documented that timing deviations in professional drum performances exhibit long-range (1/f-type) correlations rather than white-noise statistics, a finding consistent with broader work on temporal structure in human motor performance (Gilden, Thornton & Mallon 1995, Science 267: 1837-1839). The closer your modulation matches this structure, the less the auditory system flags it as alien.

NOMN's modulation matches that structure. A low-depth chorus or 1/f noise doesn't.

There is a subtler version of this question worth answering directly. Any sufficiently fast time-axis modulation alters the temporal fine structure of the signal, regardless of what control signal drives it. That much is just the physics of the operation, and it is true of an LFO, of 1/f noise, and of NOMN. But altering TFS is not automatically beneficial. The auditory system distinguishes between TFS variation that matches natural-source statistics and TFS variation that doesn't. Periodic modulation reads as effect. White-noise modulation reads as malfunction. Only modulation carrying the statistical structure of natural temporal variation reads as natural. The varispeed engine is the mechanism. The control signal is what determines whether the resulting TFS modification is something the auditory system welcomes or something it flags. The mechanism is generic. The structure is not.
Hasn't this been tried before? Isn't NOMN just like MQA or C Wave?
Our intentions are quite different nor are our claims so extreme. MQA tried to fix time-domain artifacts in the encoding/decoding chain itself, marketed lossy compression as lossless, required proprietary decoders, and treated independent measurement as an adversary. It collapsed under sustained technical critique. NOMN doesn't touch the encoding chain. We add temporal microstructure at playback, downstream of reconstruction, with conventional architecture. We think it would be great if NOMN became integrated into hardware and streaming clients and some mastering engineers found it compelling enough as a final touch.

C Wave argues that PCM is "non-continuous" and that the brain detects this discontinuity. Their solution is kinda reverb to "fill in gaps." We don't share that diagnosis. A reverb algorithm running on PCM is still PCM, and Shannon-Nyquist guarantees that properly bandlimited PCM is mathematically equivalent to a continuous waveform up to the Nyquist frequency. There are no gaps to fill in the digital signal. We're not claiming to fix something inside PCM. We're claiming that natural acoustic sources have temporal microstructure that crystal-locked playback lacks, which is a different claim, one grounded in the physical properties of natural sound sources rather than in disputed claims about sampling theory.

The single biggest lesson from those efforts: don't pick fights with sampling theory, don't claim what you can't measure, and don't treat independent measurement as an enemy.
How is this different from a humanizer plugin?
Humanizer plugins use random number generators to add timing variation to MIDI events. They've existed since the early 1990s, and they help. That's why every DAW has one.

Two differences. First, humanizers add stochastic variation. NOMN adds structured variation matched to natural source statistics. Random isn't the same as natural. The long-range correlation structure documented in human motor timing (Gilden et al. 1995; Hennig 2014) is categorically different from the white-noise distribution most humanizers produce, and the auditory system responds to that distinction.

Second, humanizers operate on MIDI event timing before audio rendering. NOMN operates on audio at the signal level. A humanizer on a quantized MIDI snare moves the hit. NOMN modulates the playback of the audio itself. Different operations, different signal-chain positions, different effects. A humanizer can't humanize a finished audio file. NOMN can.
Is the temporal modulation audible?
This is the wrong question, and the way it's usually asked is part of why the audibility debate in audio has been unproductive for so long.

If you mean "can a listener identify NOMN as a recognizable effect," generally no, and that's the design intent. A flanger that wasn't audible would be failing at its purpose. NOMN that was audible as processing would be failing at its purpose. They're aiming at opposite outcomes.

If you mean "would a listener succeed at A/B-distinguishing NOMN-processed audio from unprocessed audio in a controlled trial," that's an empirical question we intend to investigate with proper, independent, pre-registered perceptual research, and we will publish the results. It's also not the question that decides whether the technology matters or is worth pursuing or supporting.

The relevant question is the one the audio industry has been answering for decades on every other dimension of the playback chain: does the technology operate at the temporal resolution the sensory system actually uses? For sample rate, bit depth, latency, jitter, and frequency response, the industry has consistently answered yes. The production chain should match the sensory floor, not the conscious A/B detection threshold. We're applying the same engineering discipline to temporal microstructure. Whether a listener can articulate the difference in a forced-choice test on a per-track basis is a different question from whether the technology serving billions of hours of human listening should match the sensory resolution.
Why is it called NOMN? Is the last N a silent N?
The name derives from "metronome." It also reads as "no man." We treat that dual resonance, between mechanical timekeeping and human-produced variation, as productive rather than something to resolve.
Where can I read more about the auditory science you're citing?
The claims in this FAQ about how the auditory system processes time aren't ours. They're standard neuroscience, and we've cited the canonical sources so anyone can verify what we're working from. The full set:

INTERAURAL TIME DIFFERENCE THRESHOLDS

— Klumpp, R.G. & Eady, H.R. (1956). "Some Measurements of Interaural Time Difference Thresholds." Journal of the Acoustical Society of America 28(5): 859-860. The original measurement: 9μs threshold for band-limited noise, 11μs for 1000-Hz tone, 28μs for clicks (75% correct discrimination, ten listeners).

— Mills, A.W. (1958). "On the Minimum Audible Angle." Journal of the Acoustical Society of America 30(4): 237-246. Foundational measurement of angular acuity in sound localization (~1° near midline).

— Brughera, A., Dunai, L. & Hartmann, W.M. (2013). "Human interaural time difference thresholds for sine tones: The high-frequency limit." Journal of the Acoustical Society of America 133(5): 2839-2855. Modern confirmation of ~10μs thresholds for pure tones at mid-frequencies, with high-frequency cutoff around 1.4 kHz.

NEURAL CODING OF TEMPORAL STRUCTURE

— Joris, P.X., Schreiner, C.E. & Rees, A. (2004). "Neural Processing of Amplitude-Modulated Sounds." Physiological Reviews 84(2): 541-577. The standard review on how the auditory system encodes temporal modulation for source localization, identification, and parsing.

— Moore, B.C.J. (2008). "The role of temporal fine structure processing in pitch perception, masking, and speech perception for normal-hearing and hearing-impaired people." Journal of the Association for Research in Otolaryngology 9(4): 399-406. The canonical review of temporal fine structure (TFS) and its perceptual role.

— Smith, Z.M., Delgutte, B. & Oxenham, A.J. (2002). "Chimaeric sounds reveal dichotomies in auditory perception." Nature 416: 87-90. The foundational experimental demonstration that listeners rely on TFS for pitch and localization while ENV dominates speech recognition in quiet.

— Lorenzi, C., Gilbert, G., Carn, H., Garnier, S. & Moore, B.C.J. (2006). "Speech perception problems of the hearing impaired reflect inability to use temporal fine structure." Proceedings of the National Academy of Sciences 103: 18866-18869. Direct evidence for TFS's role in speech-in-noise perception.

SOURCE/ENVIRONMENT SEPARATION

— Wallach, H., Newman, E.B. & Rosenzweig, M.R. (1949). "The Precedence Effect in Sound Localization." American Journal of Psychology 62(3): 315-336. The foundational paper showing that listeners localize sounds based on first-arriving wavefront, suppressing reverberant reflections.

— Bregman, A.S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press. The standard reference text on how the auditory system organizes complex sound mixtures into source representations.

— Litovsky, R.Y., Colburn, H.S., Yost, W.A. & Guzman, S.J. (1999). "The Precedence Effect." Journal of the Acoustical Society of America 106(4): 1633-1654. Comprehensive review of the precedence effect and echo suppression literature.

LATENCY PERCEPTION AND MUSICAL PERFORMANCE

— Jack, R.H., Mehrabi, A., Stockman, T. & McPherson, A. (2018). "Action-sound Latency and the Perceived Quality of Digital Musical Instruments." Music Perception 36(1): 109-128. Professional percussionists rated 10ms±3ms jitter and 20ms latency conditions as significantly lower quality than zero latency.

— McPherson, A., Jack, R. & Moro, G. (2016). "Action-Sound Latency: Are Our Tools Fast Enough?" Proc. NIME 2016. Survey demonstrating most digital musical instrument platforms fail to meet sub-millisecond latency targets; motivates the Bela platform.

— Schmid, A., et al. (2024). "Measuring the Just Noticeable Difference for Audio Latency." Proc. Mensch und Computer 2024 (ACM). Mean JND of 27ms at 64ms base latency, with musically sophisticated listeners detecting smaller margins.

— Dahl, S. (2011). "Striking Movements: A Survey of Motion Analysis of Percussionists." Music Perception 28(5): 491-503. Documentation of percussionist timing variability.

NATURAL TIMING STATISTICS

— Hennig, H. (2014). "Synchronization in human musical rhythms and mutually interacting complex systems." Proceedings of the National Academy of Sciences 111(36): 12974-12979. Direct measurement of 1/f long-range correlations in professional drum performance timing.

— Gilden, D.L., Thornton, T. & Mallon, M.W. (1995). "1/f noise in human cognition." Science 267: 1837-1839. Broader finding of 1/f temporal structure across human cognitive and motor performance.

We cite this work because we want NOMN's perceptual claims to rest on the same foundation as the rest of the auditory science community's. Independent measurement and verification are how this field moves forward, and we don't want to be exempt from that.

By more than 10x, the fastest human sense is hearing. Humans can detect timing differences of around ten microseconds. If the monitor you're reading this on refreshes at 60Hz, that's over a thousand times slower than what your ears can resolve.

Every digital audio source on earth shares one property: timing far more stable than anything in nature. DAWs, digital synthesizers, drum machines, samplers, streaming audio — all of it is temporally rigid by design. Audiophiles chase ever-tighter stability with 10MHz external clocks. The working definition of "fidelity" has become minimal frequency instability, minimal timing variation.

In parallel, the industry spent fifty years optimizing spectral fidelity, building a digital infrastructure for music creation and listening that operates orders of magnitude below the temporal sensitivity of the system it should serve: the listener.

Sound in nature is never temporally rigid. Every acoustic instrument, every voice, every bit of wind through an environment exhibits continuous microsecond-scale timing variation arising from the physics of its production. These variations are not imperfections — they are part of what the auditory system recognizes as aliveness. The critical sub-technology that is the keystone of all audio technologies is an underlying periodicity, or clock. Whether it is an electrical frequency that is modulated, a spinning wax cylinder, a record lathe, or a digital-to-analog converter, there is always a method to quantify and to maintain the newly minted quanta's logical structure throughout the system. If that clock degrades, the illusion falls apart: like a flipbook turned too slow, the perceptual hack fails.

Record players and analog tape machines don't sound better — they feel better. They are microtiming enhancers that, by accident, introduce random temporal variation into the signal. The mechanical instabilities of a turntable or tape transport introduce variation in the time domain coupled with frequency instability. This is a quality people spend enormous sums chasing through vinyl pressings, vacuum tubes, and analog signal chains — often without being able to name what they're hearing, because what they're hearing is not spectral. It's temporal.

NOMN introduces temporal life to digital audio. It is a temporal fine structure enhancement system that adds human-structured, non-repeating timing variation to any audio stream, operating at the resolution of the human perceptual system.

In the early twentieth century, the artist Marcel Duchamp coined "inframince" — the infrathin — for the separative difference between things that appear identical. Two objects from the same mold, identical yet not.

NOMN takes the infrathin separative difference between living and mechanical time and makes it operable.

--
## How It Works

NOMN is built on a generative model of organic temporal behavior derived from eighty spoken languages. At runtime, the system produces a continuous stream of timing variations — over 1,000 updates per second — and applies them to incoming audio. The original content is preserved entirely. Nothing is added or removed from the signal. Only the temporal microstructure is enriched, at a scale below the threshold of something like swing or groove but within the threshold of perceptual effect.

The variations are not random and cannot be duplicated with jitter. They are not periodic. They do not loop. They are contextually structured and non-repeating — generated live for every moment of audio that passes through.

NOMN does not claim that digital audio is missing temporal fine structure, or that it restores something the format lost. A digital recording carries fine structure for the in-band content of the signal. NOMN's premise is different: clock-locked playback presents whatever is there with perfect temporal stationarity, a stationarity no natural acoustic source has. NOMN introduces structured time-variation into the playback.


Use Cases

Mastering & Post-Production
A new dimension of audio enhancement orthogonal to EQ, compression, spatial processing, and loudness. Applicable to any master, any genre, any era of recording.

Streaming & Playback
Deployable as a real-time processing layer in streaming infrastructure or playback devices. Enhances any audio passing through — music, podcasts, film audio — without content modification.

Hardware Integration
The system's compute footprint is small enough for embedded deployment on audio DSP chips — small enough for earbuds, automotive head units, and portable players. Licensable for integration into consumer audio hardware, automotive audio systems, and professional equipment.

--
## What It Isn't

NOMN is not an equalizer, a compressor, a spatial processor, or an effect. It does not modify frequency content, dynamic range, stereo image, or loudness. It does not add harmonics, noise, or saturation. The modification is in the time domain.

--
## Technical Notes

NOMN's timing variations operate at microsecond-to-millisecond scales — the same order of magnitude as the timing instabilities in analog playback systems and finer, but structured rather than mechanical, and non-repeating rather than periodic.

The system includes continuous quality validation that monitors the relationship between intended and rendered timing, helping ensure the enhancement survives the full signal chain from processing through output. Null test analysis shows no added harmonics, noise, EQ, or spatial processing — the difference between input and output is in the time domain.

--
## Formats & Access

API: RESTful HTTP endpoint. Send audio, receive processed audio. Optional control parameters. Automatic mode available.

Licensing: Available for integration into hardware, software, and streaming infrastructure. Per-device, per-track, or enterprise licensing models.

Patent Status: Patent pending (Japan, 2026). POLYTOPE KK.

--
## What is digital audio and why is it so confusing?

There is something kind of essentially confusing about digital audio that's less intuitive than, say, the idea of individual pixels forming an image on a screen. We have all been confused and you will find very divergent understandings online, in discussion forums and audiophile communities. We think of all audio as a kind of perceptual parlor trick that works strangely well and with something even more powerful than realism, fantasy, but underneath that power are a whole lot of numbers that somehow end up pushing compression waves towards your body in a believable enough way that our small human brains believe it's almost real and we find connection through it.

Digital audio works on the encoding side — recording, or making an audio file — by taking very fast measurements of a continuously varying signal and storing them as a sequence of numbers.

### The smallest units of digital audio and their qualities

A sample is a single one of those measurements, an integer (no decimal) or float (decimal) representing the instantaneous amplitude of the waveform at one moment. Contrary to how it's described in analog-centric communities these are not "zeros and ones" but a rapidfire graphing of a compression wave over time.

The sample rate determines how often those measurements happen. It's expressed in samples per second. At 96kHz the system captures 96,000 amplitude values every second, each one a snapshot of where the waveform is right then. At 44.1kHz (CD) it's 44,100 per second. A common intuition is that the file "contains nothing" between samples — but this is the single most misleading way to think about it. A properly bandlimited sampled signal is a complete representation of the original waveform up to the Nyquist frequency. There is no missing information between the samples. The DAC's reconstruction filter doesn't guess or fill a gap; it reconstructs the one continuous waveform the samples uniquely describe. Higher sample rates like 192kHz or DSD don't add information the ear was missing; they move the reconstruction filter's work further away from the audible band. DSD (Direct Stream Digital) is a family of rates running from 2.8224 MHz at the base level (DSD64) up to 22.5792 MHz at DSD512. DSD is a 1-bit format, meaning each sample only stores whether the signal is moving up or down, with the very high clock rate compensating for the low bit depth through noise shaping.

The samples themselves contain nothing about frequency, timbre, or pitch. This is what's so confusing. There is no analysis happening beyond an amplitude value within these tiny moments. Those properties emerge from the pattern across many samples. A speaker cone only needs to know where to be at each moment, and a sequence of "where to be" values is enough to trace out any waveform. The speaker is displacing air with these movements and this displacement results in compression waves your body can sense.

The Nyquist limit is another confusing term that is often brought into conversations around audio quality. It's a way to describe the practical physical consequence of sampling at a finite rate: to capture a wave that wiggles at frequency F, you need to sample at more than twice F per second, because sampling the wave slower doesn't have enough sample points per cycle of the wave to be reconstructed unambiguously. So if you imagine a wave flying by, you need to quickly touch it in enough points that someone — or in this case a machine — could understand its size by capturing its high and low point, at least.

Let's pause here. We are talking literally about the size of a wave in air. The highest frequencies humans can hear correspond to wavelengths roughly the width of a fingernail. Human hearing tops out near 20kHz, which is why 44.1kHz and 48kHz became standard. Both leave a comfortable margin above the audible band. Higher rates like 96kHz or 192kHz don't extend what you can hear. They give the analog reconstruction filters at the DAC more room to operate cleanly in the audible range.

44.1kHz sampling → 22.05kHz max frequency → 15.6 mm wavelength
48kHz sampling → 24kHz max → 14.3 mm
96kHz sampling → 48kHz max → 7.1 mm
192kHz sampling → 96kHz max → 3.6 mm
384kHz sampling → 192kHz max → 1.8 mm
768kHz sampling → 384kHz max → 0.89 mm

DSD operates differently and isn't directly comparable on this table. Its raw clock rate at DSD512 is 22.5792 MHz, but that's a 1-bit modulator clock, not a PCM Nyquist limit, and the usable audio bandwidth is shaped by the noise-shaping filter rather than set by half the sample rate.

Bit depth is how precisely each measurement is stored. 24-bit gives ~16.7 million possible amplitude values per sample, which sets the dynamic range (the potential difference between loud and soft) and noise floor. Keep in mind, "bit depth" also doesn't know anything and its effect is similarly rapid and aggregate. It's not that if you have more bit depth your computer somehow knows that it can now render the sound of a bow touching a string.

### Buckets, chunks, and frames

A buffer is a small chunk of consecutive samples that the system processes as a group, because it would be wildly inefficient to hand off samples one at a time between software, drivers, and hardware. General-purpose computers still struggle with moving audio at super speeds without loads of jitter. A typical buffer might make a bucket of 64, 128, or 512 samples. At 96kHz, a 64-sample buffer represents about 0.67 milliseconds of audio. Smaller buffers mean lower latency, the time between a signal entering the system and leaving it, but require more frequent processing and demand more from the CPU and are more subject to various weirdnesses and interferences from the operating system or hardware and their firmware. Larger buffers are easier on the processor (ah, relaxing) but introduce noticeable delay, which matters for live performance and monitoring because humans are such incredible time-keepers.

When audio has multiple channels, stereo, surround, or more, each moment in time has one sample per channel, and the group of simultaneous samples across all channels is called a frame. A stereo recording at 96kHz produces 96,000 frames per second, each frame containing two samples, left and right. Buffer sizes are usually counted in frames rather than samples, since that's what corresponds to a duration of audio regardless of channel count.

### Back to air

So at some point, for a human to perceive any of this we need to take this whole rapid fire bucket passing situation and turn it into air. As the audio engine fills each buffer with samples and frames, processes it, and hands it off to the DAC, the DAC then converts the numbers back into voltages that drive a speaker, moving the cone as precisely as the speaker is capable of to the positions indicated.

The whole cycle repeats thousands of times per second, fast enough that the listener perceives a continuous, seamless waveform rather than a sequence of discrete blocks.

--
## A Note on Subtlety

The effect is subtle by design. It is not a discrete change you hear like an EQ — it is a qualitative shift in how audio feels as a temporal experience. Audio has always functioned through the exploitation of the ear's temporal resolution: a clock fast enough to exceed perceptual discrimination produces the illusion of continuity. NOMN operates at this same threshold, not by degrading the clock but by giving it the kind of structured instability that acoustic and mechanical systems have always had and digital systems do not.

Whether this matters for a given listener, a given recording, a given playback chain is an empirical question, not a rhetorical one. We don't make claims about what you'll feel. But we feel it, and hope you will too.